The University of Southampton
University of Southampton Institutional Repository

Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates

Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates
Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates
This paper assesses the feasibility of using crowdsourcing techniques for subjective tasks, like the identification of argumentative relations in political debates, and analyses their inter-annotator metrics, common sources of error and disagreements. We aim to address how best to evaluate subjective crowdsourced annotations, which often exhibit significant annotator disagreements and contribute to a "quality crisis" in crowdsourcing. To do this, we compare two datasets of crowd annotations for argumentation mining performed by an open crowd with quality control settings and a small group of master annotators without these settings but with several rounds of feedback. Our results show high levels of disagreement between annotators with a rather low Krippendorf's alpha, a commonly used inter-annotator metric. This metric also fluctuates greatly and is highly sensitive to the amount of overlap between annotators, whereas other common metrics like Cohen's and Fleiss' kappa are not suitable for this task due to their underlying assumptions. We evaluate the appropriateness of the Krippendorf's alpha metric for this type of annotation and find that it may not be suitable for cases with many annotators coding only small subsets of the data. This highlights the need for more robust evaluation metrics for subjective crowdsourcing tasks. Our datasets provide a benchmark for future research in this area and can be used to increase data quality, inform the design of further work, and mitigate common errors in subjective coding, particularly in argumentation mining.
AAAI Press
Mestre, Rafael
33721a01-ab1a-4f71-8b0e-abef8afc92f3
Ryan, Matt
f07cd3e8-f3d9-4681-9091-84c2df07cd54
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Gomer, Richard
71c5969f-2da0-47ab-b2fb-a7e1d07836b1
Gheasi, Masood
0e1a0af4-3f82-4498-a5e5-4f7f7618d68e
Zhu, Jiatong
52569115-5d72-4fc0-8876-a66b991ed209
Norman, Timothy
663e522f-807c-4569-9201-dc141c8eb50d
Mestre, Rafael
33721a01-ab1a-4f71-8b0e-abef8afc92f3
Ryan, Matt
f07cd3e8-f3d9-4681-9091-84c2df07cd54
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Gomer, Richard
71c5969f-2da0-47ab-b2fb-a7e1d07836b1
Gheasi, Masood
0e1a0af4-3f82-4498-a5e5-4f7f7618d68e
Zhu, Jiatong
52569115-5d72-4fc0-8876-a66b991ed209
Norman, Timothy
663e522f-807c-4569-9201-dc141c8eb50d

Mestre, Rafael, Ryan, Matt, Middleton, Stuart E, Gomer, Richard, Gheasi, Masood, Zhu, Jiatong and Norman, Timothy (2023) Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates. In Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media. AAAI Press. 12 pp . (doi:10.36190/2023.52).

Record type: Conference or Workshop Item (Paper)

Abstract

This paper assesses the feasibility of using crowdsourcing techniques for subjective tasks, like the identification of argumentative relations in political debates, and analyses their inter-annotator metrics, common sources of error and disagreements. We aim to address how best to evaluate subjective crowdsourced annotations, which often exhibit significant annotator disagreements and contribute to a "quality crisis" in crowdsourcing. To do this, we compare two datasets of crowd annotations for argumentation mining performed by an open crowd with quality control settings and a small group of master annotators without these settings but with several rounds of feedback. Our results show high levels of disagreement between annotators with a rather low Krippendorf's alpha, a commonly used inter-annotator metric. This metric also fluctuates greatly and is highly sensitive to the amount of overlap between annotators, whereas other common metrics like Cohen's and Fleiss' kappa are not suitable for this task due to their underlying assumptions. We evaluate the appropriateness of the Krippendorf's alpha metric for this type of annotation and find that it may not be suitable for cases with many annotators coding only small subsets of the data. This highlights the need for more robust evaluation metrics for subjective crowdsourcing tasks. Our datasets provide a benchmark for future research in this area and can be used to increase data quality, inform the design of further work, and mitigate common errors in subjective coding, particularly in argumentation mining.

Text
2023_52 - Version of Record
Restricted to Repository staff only
Request a copy

More information

e-pub ahead of print date: 1 June 2023
Published date: 1 June 2023

Identifiers

Local EPrints ID: 479080
URI: http://eprints.soton.ac.uk/id/eprint/479080
PURE UUID: 29d67e97-8e6d-4b82-b6c4-a20301d4f867
ORCID for Rafael Mestre: ORCID iD orcid.org/0000-0002-2460-4234
ORCID for Matt Ryan: ORCID iD orcid.org/0000-0002-8693-5063
ORCID for Stuart E Middleton: ORCID iD orcid.org/0000-0001-8305-8176
ORCID for Richard Gomer: ORCID iD orcid.org/0000-0001-8866-3738
ORCID for Timothy Norman: ORCID iD orcid.org/0000-0002-6387-4034

Catalogue record

Date deposited: 19 Jul 2023 17:11
Last modified: 18 Mar 2024 03:14

Export record

Altmetrics

Contributors

Author: Rafael Mestre ORCID iD
Author: Matt Ryan ORCID iD
Author: Richard Gomer ORCID iD
Author: Masood Gheasi
Author: Jiatong Zhu
Author: Timothy Norman ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×