AI and human scoring for postgraduate writing: evaluating score reliability, variability, and rater behaviours

This study examines the reliability and consistency of AutoMarkGPT, a customized version of ChatGPT-4.0, in scoring postgraduate writing assignments across multiple time intervals. While Automated Writing Evaluation (AWE) tools and AI models are increasingly utilized in educational contexts, prior research has largely relied on standard models and one-time scoring sessions. Addressing this gap, the study compares AutoMarkGPT’s performance with that of four human raters who assessed the same 97 assignments. Employing a convergent parallel mixed-methods design, the research integrates quantitative analysis, including t-tests, correlations, and Many-Facet Rasch Measurement via Facets, with qualitative data from post-rating interviews. Results reveal that AutoMarkGPT provided more consistent and generally higher scores than human raters, who demonstrated stricter grading and greater variability due to subjective factors, such as rubric interpretation and professional background. However, AI showed mild fluctuations in scores over time. Findings suggest that blending AI and human input could enhance assessment reliability, provided continuous rater training is ensured.

AI in scoring, Automated Writing Evaluation (AWE), Rater behaviour, Score reliability, Score variability

10.1016/j.stueduc.2026.101572

1879-2529

Han, Turgay

e7fe202c-a0dd-426b-81e9-5beaf4458d7a

Zheng, Ying

abc38a5e-a4ba-460e-92e2-b766d11d2b29

11 February 2026

Han, Turgay

e7fe202c-a0dd-426b-81e9-5beaf4458d7a

Zheng, Ying

abc38a5e-a4ba-460e-92e2-b766d11d2b29

Han, Turgay and Zheng, Ying (2026) AI and human scoring for postgraduate writing: evaluating score reliability, variability, and rater behaviours. Studies in Educational Evaluation, 88, [101572]. (doi:10.1016/j.stueduc.2026.101572).