Considering assessor agreement in IR evaluation
Considering assessor agreement in IR evaluation
The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance recently in light of crowdsourced relevance judgments, where it is customary to gather many relevance labels for each topic-document pair. In a crowdsourcing setting, agreement is often even used as a proxy for quality, although without any systematic verification of the conjecture that higher agreement corresponds to higher quality. In this paper we address this issue and we study in particular: the effect of topic on assessor agreement; the relationship between assessor agreement and judgment quality; the effect of agreement on ranking systems according to their effectiveness; and the definition of an agreement-aware effectiveness metric that does not discard information about multiple judgments for the same document as it typically happens in a crowdsourcing setting.
Agreement, Disagreement, Evaluation, Test collections, TREC
75-82
Association for Computing Machinery
Maddalena, Eddy
397dbaba-4363-4c11-8e52-4a7ba4df4bae
Roitero, Kevin
71dbbb60-a1e9-431e-930d-fbd498c6559f
Demartini, Gianluca
2da91fe3-eac2-42d8-8450-b7d74b1d0209
Mizzaro, Stefano
7be30144-afe5-42bb-a861-59a451decc20
1 October 2017
Maddalena, Eddy
397dbaba-4363-4c11-8e52-4a7ba4df4bae
Roitero, Kevin
71dbbb60-a1e9-431e-930d-fbd498c6559f
Demartini, Gianluca
2da91fe3-eac2-42d8-8450-b7d74b1d0209
Mizzaro, Stefano
7be30144-afe5-42bb-a861-59a451decc20
Maddalena, Eddy, Roitero, Kevin, Demartini, Gianluca and Mizzaro, Stefano
(2017)
Considering assessor agreement in IR evaluation.
In ICTIR 2017: Proceedings of the 2017 ACM SIGIR International Conference on the Theory of Information Retrieval.
Association for Computing Machinery.
.
(doi:10.1145/3121050.3121060).
Record type:
Conference or Workshop Item
(Paper)
Abstract
The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance recently in light of crowdsourced relevance judgments, where it is customary to gather many relevance labels for each topic-document pair. In a crowdsourcing setting, agreement is often even used as a proxy for quality, although without any systematic verification of the conjecture that higher agreement corresponds to higher quality. In this paper we address this issue and we study in particular: the effect of topic on assessor agreement; the relationship between assessor agreement and judgment quality; the effect of agreement on ranking systems according to their effectiveness; and the definition of an agreement-aware effectiveness metric that does not discard information about multiple judgments for the same document as it typically happens in a crowdsourcing setting.
This record has no associated files available for download.
More information
Published date: 1 October 2017
Venue - Dates:
7th ACM SIGIR International Conference on the Theory of Information Retrieval, ICTIR 2017, , Amsterdam, Netherlands, 2017-10-01 - 2017-10-04
Keywords:
Agreement, Disagreement, Evaluation, Test collections, TREC
Identifiers
Local EPrints ID: 420463
URI: http://eprints.soton.ac.uk/id/eprint/420463
PURE UUID: 9a8a78c4-d637-46d5-bce1-60cf8612ada7
Catalogue record
Date deposited: 08 May 2018 16:30
Last modified: 17 Mar 2024 12:03
Export record
Altmetrics
Contributors
Author:
Kevin Roitero
Author:
Gianluca Demartini
Author:
Stefano Mizzaro
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics