On fine-grained relevance scales

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

IR Evaluation, Relevance Scales

10.1145/3209978.3210052

675-684

Association for Computing Machinery

Roitero, Kevin

71dbbb60-a1e9-431e-930d-fbd498c6559f

Maddalena, Eddy

397dbaba-4363-4c11-8e52-4a7ba4df4bae

Demartini, Gianluca

2da91fe3-eac2-42d8-8450-b7d74b1d0209

Mizzaro, Stefano

7be30144-afe5-42bb-a861-59a451decc20

2018