Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates
Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates
This paper assesses the feasibility of using crowdsourcing techniques for subjective tasks, like the identification of argumentative relations in political debates, and analyses their inter-annotator metrics, common sources of error and disagreements. We aim to address how best to evaluate subjective crowdsourced annotations, which often exhibit significant annotator disagreements and contribute to a "quality crisis" in crowdsourcing. To do this, we compare two datasets of crowd annotations for argumentation mining performed by an open crowd with quality control settings and a small group of master annotators without these settings but with several rounds of feedback. Our results show high levels of disagreement between annotators with a rather low Krippendorf's alpha, a commonly used inter-annotator metric. This metric also fluctuates greatly and is highly sensitive to the amount of overlap between annotators, whereas other common metrics like Cohen's and Fleiss' kappa are not suitable for this task due to their underlying assumptions. We evaluate the appropriateness of the Krippendorf's alpha metric for this type of annotation and find that it may not be suitable for cases with many annotators coding only small subsets of the data. This highlights the need for more robust evaluation metrics for subjective crowdsourcing tasks. Our datasets provide a benchmark for future research in this area and can be used to increase data quality, inform the design of further work, and mitigate common errors in subjective coding, particularly in argumentation mining.
Mestre, Rafael
33721a01-ab1a-4f71-8b0e-abef8afc92f3
Ryan, Matt
f07cd3e8-f3d9-4681-9091-84c2df07cd54
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Gomer, Richard
71c5969f-2da0-47ab-b2fb-a7e1d07836b1
Gheasi, Masood
0e1a0af4-3f82-4498-a5e5-4f7f7618d68e
Zhu, Jiatong
52569115-5d72-4fc0-8876-a66b991ed209
Norman, Timothy
663e522f-807c-4569-9201-dc141c8eb50d
1 June 2023
Mestre, Rafael
33721a01-ab1a-4f71-8b0e-abef8afc92f3
Ryan, Matt
f07cd3e8-f3d9-4681-9091-84c2df07cd54
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Gomer, Richard
71c5969f-2da0-47ab-b2fb-a7e1d07836b1
Gheasi, Masood
0e1a0af4-3f82-4498-a5e5-4f7f7618d68e
Zhu, Jiatong
52569115-5d72-4fc0-8876-a66b991ed209
Norman, Timothy
663e522f-807c-4569-9201-dc141c8eb50d
Mestre, Rafael, Ryan, Matt, Middleton, Stuart E, Gomer, Richard, Gheasi, Masood, Zhu, Jiatong and Norman, Timothy
(2023)
Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates.
In Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media.
AAAI Press.
12 pp
.
(doi:10.36190/2023.52).
Record type:
Conference or Workshop Item
(Paper)
Abstract
This paper assesses the feasibility of using crowdsourcing techniques for subjective tasks, like the identification of argumentative relations in political debates, and analyses their inter-annotator metrics, common sources of error and disagreements. We aim to address how best to evaluate subjective crowdsourced annotations, which often exhibit significant annotator disagreements and contribute to a "quality crisis" in crowdsourcing. To do this, we compare two datasets of crowd annotations for argumentation mining performed by an open crowd with quality control settings and a small group of master annotators without these settings but with several rounds of feedback. Our results show high levels of disagreement between annotators with a rather low Krippendorf's alpha, a commonly used inter-annotator metric. This metric also fluctuates greatly and is highly sensitive to the amount of overlap between annotators, whereas other common metrics like Cohen's and Fleiss' kappa are not suitable for this task due to their underlying assumptions. We evaluate the appropriateness of the Krippendorf's alpha metric for this type of annotation and find that it may not be suitable for cases with many annotators coding only small subsets of the data. This highlights the need for more robust evaluation metrics for subjective crowdsourcing tasks. Our datasets provide a benchmark for future research in this area and can be used to increase data quality, inform the design of further work, and mitigate common errors in subjective coding, particularly in argumentation mining.
Text
2023_52
- Version of Record
Restricted to Repository staff only
Request a copy
More information
e-pub ahead of print date: 1 June 2023
Published date: 1 June 2023
Identifiers
Local EPrints ID: 479080
URI: http://eprints.soton.ac.uk/id/eprint/479080
PURE UUID: 29d67e97-8e6d-4b82-b6c4-a20301d4f867
Catalogue record
Date deposited: 19 Jul 2023 17:11
Last modified: 18 Mar 2024 03:14
Export record
Altmetrics
Contributors
Author:
Richard Gomer
Author:
Jiatong Zhu
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics