The University of Southampton
University of Southampton Institutional Repository

Revisiting reverts: accurate revert detection in Wikipedia

Revisiting reverts: accurate revert detection in Wikipedia
Revisiting reverts: accurate revert detection in Wikipedia
Wikipedia is commonly used as a proving ground for research in collaborative systems. This is likely due to its popularity and scale, but also to the fact that large amounts of data about its formation and evolution are freely available to inform and validate theories and models of online collaboration. As part of the development of such approaches, revert detection is often performed as an important pre-processing step in tasks as diverse as the extraction of implicit networks of editors, the analysis of edit or editor features and the removal of noise when analyzing the emergence of the content of an article. The current state of the art in revert detection is based on a rather naive approach, which identifies revision duplicates based on MD5 hash values. This is an efficient, but not very precise technique that forms the basis for the majority of research based on revert relations in Wikipedia. In this paper we prove that this method has a number of important drawbacks - it only detects a limited number of reverts, while simultaneously misclassifying too many edits as reverts, and not distinguishing between complete and partial reverts. This is very likely to hamper the accurate interpretation of the findings of revert-related research. We introduce an improved algorithm for the detection of reverts based on word tokens added or deleted to addresses these drawbacks. We report on the results of a user study and other tests demonstrating the considerable gains in accuracy and coverage by our method, and argue for a positive trade-off, in certain research scenarios, between these improvements and our algorithm's increased runtime.
3-12
Flöck, F.
d852454b-267a-4a5c-8edc-2cfea496dc58
Vrandecic, D.
2642fe14-9606-4ee3-8616-67f30849f3b3
Simperl, E.
40261ae4-c58c-48e4-b78b-5187b10e4f67
Flöck, F.
d852454b-267a-4a5c-8edc-2cfea496dc58
Vrandecic, D.
2642fe14-9606-4ee3-8616-67f30849f3b3
Simperl, E.
40261ae4-c58c-48e4-b78b-5187b10e4f67

Flöck, F., Vrandecic, D. and Simperl, E. (2012) Revisiting reverts: accurate revert detection in Wikipedia. Proceedings of the 23rd ACM Conference on Hypertext and Social Media, Milwaukee, United States. 25 - 28 Jun 2012. pp. 3-12 . (doi:10.1145/2309996.2310000).

Record type: Conference or Workshop Item (Paper)

Abstract

Wikipedia is commonly used as a proving ground for research in collaborative systems. This is likely due to its popularity and scale, but also to the fact that large amounts of data about its formation and evolution are freely available to inform and validate theories and models of online collaboration. As part of the development of such approaches, revert detection is often performed as an important pre-processing step in tasks as diverse as the extraction of implicit networks of editors, the analysis of edit or editor features and the removal of noise when analyzing the emergence of the content of an article. The current state of the art in revert detection is based on a rather naive approach, which identifies revision duplicates based on MD5 hash values. This is an efficient, but not very precise technique that forms the basis for the majority of research based on revert relations in Wikipedia. In this paper we prove that this method has a number of important drawbacks - it only detects a limited number of reverts, while simultaneously misclassifying too many edits as reverts, and not distinguishing between complete and partial reverts. This is very likely to hamper the accurate interpretation of the findings of revert-related research. We introduce an improved algorithm for the detection of reverts based on word tokens added or deleted to addresses these drawbacks. We report on the results of a user study and other tests demonstrating the considerable gains in accuracy and coverage by our method, and argue for a positive trade-off, in certain research scenarios, between these improvements and our algorithm's increased runtime.

This record has no associated files available for download.

More information

Published date: June 2012
Venue - Dates: Proceedings of the 23rd ACM Conference on Hypertext and Social Media, Milwaukee, United States, 2012-06-25 - 2012-06-28
Related URLs:
Organisations: Web & Internet Science

Identifiers

Local EPrints ID: 351610
URI: http://eprints.soton.ac.uk/id/eprint/351610
PURE UUID: a9d84369-2da9-48dc-9b8c-c34ab0c3e432
ORCID for E. Simperl: ORCID iD orcid.org/0000-0003-1722-947X

Catalogue record

Date deposited: 29 Apr 2013 13:44
Last modified: 14 Mar 2024 13:41

Export record

Altmetrics

Contributors

Author: F. Flöck
Author: D. Vrandecic
Author: E. Simperl ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×