Artificial intelligence-generated and human expert-designed vocabulary tests: a comparative study

Artificial intelligence (AI) technologies have the potential to reduce the workload for the second language (L2) teachers and test developers. We propose two AI distractor-generating methods for creating Chinese vocabulary items: semantic similarity and visual similarity. Semantic similarity refers to antonyms and synonyms, while visual similarity refers to the phenomenon that two phrases share one or more characters in common. This study explores the construct validity of the two types of selected-response vocabulary tests (AI-generated items and human expert-designed items) and compares their item difficulty and item discrimination. Both quantitative and qualitative data were collected. Seventy-eight students from Beijing Language and Culture University were asked to respond to AI-generated and human expert-designed items respectively. Students’ scores were analyzed using the two-parameter item response theory (2PL-IRT) model. Thirteen students were then invited to report their test taking strategies in the think-aloud section. The findings from the students’ item responses revealed that the human expert-designed items were easier but had more discriminating power than the AI-generated items. The results of think-aloud data indicated that the AI-generated items and expert-designed items might assess different constructs, in which the former elicited test takers’ bottom-up test-taking strategies while the latter seemed more likely to trigger test takers’ rote memorization ability.

Artificial intelligence, Computerised test, construct validity, vocabulary test

10.1177/21582440221082130

2158-2440

1-12

Luo, Yunjiu

20c491ef-72ae-494d-a4fe-2a6f850bce46

Wei, Wei

cfb0a0ca-3c06-49c0-be81-54e39ff2dad6

Zheng, Ying

abc38a5e-a4ba-460e-92e2-b766d11d2b29