The University of Southampton
University of Southampton Institutional Repository

Assessing the quality of sources in Wikidata across languages: a hybrid approach

Assessing the quality of sources in Wikidata across languages: a hybrid approach
Assessing the quality of sources in Wikidata across languages: a hybrid approach
Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.
1936-1955
1-35
Amaral, Gabriel
9abfb8e3-177b-4f21-86a8-53e5d3673e64
Piscopo, Alessandro
f195da29-8b99-46e1-a38e-fd31186d55f4
Kaffee, Lucie-aimee frimelle F
27b306cf-2c1c-4f4d-8b67-8f7d46d44867
Rodrigues, Odinaldo
d0b0d73a-5a07-4aeb-8c96-1bfafa1ee53a
Simperl, Elena
68e2d4e7-e1f7-414b-b478-f8b3f7eb085e
Amaral, Gabriel
9abfb8e3-177b-4f21-86a8-53e5d3673e64
Piscopo, Alessandro
f195da29-8b99-46e1-a38e-fd31186d55f4
Kaffee, Lucie-aimee frimelle F
27b306cf-2c1c-4f4d-8b67-8f7d46d44867
Rodrigues, Odinaldo
d0b0d73a-5a07-4aeb-8c96-1bfafa1ee53a
Simperl, Elena
68e2d4e7-e1f7-414b-b478-f8b3f7eb085e

Amaral, Gabriel, Piscopo, Alessandro, Kaffee, Lucie-aimee frimelle F, Rodrigues, Odinaldo and Simperl, Elena (2021) Assessing the quality of sources in Wikidata across languages: a hybrid approach. ACM Journal of Data and Information Quality, 13 (4), 1-35. (doi:10.1145/3484828).

Record type: Article

Abstract

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.

Text
2109.09405 - Accepted Manuscript
Download (846kB)

More information

Accepted/In Press date: 1 September 2021
e-pub ahead of print date: 15 October 2021
Additional Information: arXiv:2109.09405

Identifiers

Local EPrints ID: 455606
URI: http://eprints.soton.ac.uk/id/eprint/455606
ISSN: 1936-1955
PURE UUID: a8dae652-ee61-4971-9b4e-74a3c9b4efe5

Catalogue record

Date deposited: 29 Mar 2022 16:31
Last modified: 16 Mar 2024 16:16

Export record

Altmetrics

Contributors

Author: Gabriel Amaral
Author: Alessandro Piscopo
Author: Lucie-aimee frimelle F Kaffee
Author: Odinaldo Rodrigues
Author: Elena Simperl

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×