AI and human scoring for postgraduate writing: evaluating score reliability, variability, and rater behaviours
AI and human scoring for postgraduate writing: evaluating score reliability, variability, and rater behaviours
This study examines the reliability and consistency of AutoMarkGPT, a customized version of ChatGPT-4.0, in scoring postgraduate writing assignments across multiple time intervals. While Automated Writing Evaluation (AWE) tools and AI models are increasingly utilized in educational contexts, prior research has largely relied on standard models and one-time scoring sessions. Addressing this gap, the study compares AutoMarkGPT’s performance with that of four human raters who assessed the same 97 assignments. Employing a convergent parallel mixed-methods design, the research integrates quantitative analysis, including t-tests, correlations, and Many-Facet Rasch Measurement via Facets, with qualitative data from post-rating interviews. Results reveal that AutoMarkGPT provided more consistent and generally higher scores than human raters, who demonstrated stricter grading and greater variability due to subjective factors, such as rubric interpretation and professional background. However, AI showed mild fluctuations in scores over time. Findings suggest that blending AI and human input could enhance assessment reliability, provided continuous rater training is ensured.
AI in scoring, Automated Writing Evaluation (AWE), Rater behaviour, Score reliability, Score variability
Han, Turgay
e7fe202c-a0dd-426b-81e9-5beaf4458d7a
Zheng, Ying
abc38a5e-a4ba-460e-92e2-b766d11d2b29
11 February 2026
Han, Turgay
e7fe202c-a0dd-426b-81e9-5beaf4458d7a
Zheng, Ying
abc38a5e-a4ba-460e-92e2-b766d11d2b29
Han, Turgay and Zheng, Ying
(2026)
AI and human scoring for postgraduate writing: evaluating score reliability, variability, and rater behaviours.
Studies in Educational Evaluation, 88, [101572].
(doi:10.1016/j.stueduc.2026.101572).
Abstract
This study examines the reliability and consistency of AutoMarkGPT, a customized version of ChatGPT-4.0, in scoring postgraduate writing assignments across multiple time intervals. While Automated Writing Evaluation (AWE) tools and AI models are increasingly utilized in educational contexts, prior research has largely relied on standard models and one-time scoring sessions. Addressing this gap, the study compares AutoMarkGPT’s performance with that of four human raters who assessed the same 97 assignments. Employing a convergent parallel mixed-methods design, the research integrates quantitative analysis, including t-tests, correlations, and Many-Facet Rasch Measurement via Facets, with qualitative data from post-rating interviews. Results reveal that AutoMarkGPT provided more consistent and generally higher scores than human raters, who demonstrated stricter grading and greater variability due to subjective factors, such as rubric interpretation and professional background. However, AI showed mild fluctuations in scores over time. Findings suggest that blending AI and human input could enhance assessment reliability, provided continuous rater training is ensured.
Text
Han & Zheng 2026 (accepted version)
- Accepted Manuscript
Restricted to Repository staff only until 11 August 2027.
Request a copy
More information
Accepted/In Press date: 23 January 2026
e-pub ahead of print date: 11 February 2026
Published date: 11 February 2026
Keywords:
AI in scoring, Automated Writing Evaluation (AWE), Rater behaviour, Score reliability, Score variability
Identifiers
Local EPrints ID: 510517
URI: http://eprints.soton.ac.uk/id/eprint/510517
ISSN: 1879-2529
PURE UUID: 2a93941e-428a-4c9e-a0e2-b2b5418c7fee
Catalogue record
Date deposited: 13 Apr 2026 14:38
Last modified: 14 Apr 2026 01:49
Export record
Altmetrics
Contributors
Author:
Turgay Han
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics