The University of Southampton
University of Southampton Institutional Repository

Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning

Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning
Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning
Social media has led to the democratisation of opinion sharing. A wealth of information about public opinions, current events, and authors’ insights into specific topics can be gained by understanding the text written by users. However, there is a wide variation in the language used by different authors in different contexts on the web. This diversity in language makes interpretation an extremely challenging task. Crowdsourcing presents an opportunity to interpret the sentiment, or topic, of free-text. However, the subjectivity and bias of human interpreters raise challenges in inferring the semantics expressed by the text. To overcome this problem, we present a novel Bayesian approach to language understanding that relies on aggregated crowdsourced judgements. Our model encodes the relationships between labels and text features in documents, such as tweets, web articles, and blog posts, accounting for the varying reliability of human labellers. It allows inference of annotations that scales to arbitrarily large pools of documents. Our evaluation shows that by efficiently exploiting language models learnt from aggregated crowdsourced labels, we can provide up to 25% improved classifications when only a small portion, less than 4% of documents has been labelled. Compared to the six state-of-the-art methods, we reduce by up to 67% the number of crowd responses required to achieve comparable accuracy. Our method was a joint winner of the CrowdFlower - CrowdScale 2013 Shared Task challenge at the conference on Human Computation and Crowdsourcing (HCOMP 2013).
992-1002
Simpson, Edwin
ebf1cc2d-6633-4182-ab5a-91c832816a97
Venanzi, Matteo
ba24a77f-31a6-4c05-a647-babf8f660440
Reece, Steven
b79cac5b-bbd2-4038-b47d-3d4c845802aa
Kohli, Pushmeet
ede0d0ca-fe91-49c2-8c42-ee0fa4298e33
Guiver, John
8df5d373-1101-4b08-86bb-26a1cbab1c07
Roberts, Stephen
fef5d01c-92bd-44cf-93f0-923ec24f8875
Jennings, Nicholas R.
ab3d94cc-247c-4545-9d1e-65873d6cdb30
Simpson, Edwin
ebf1cc2d-6633-4182-ab5a-91c832816a97
Venanzi, Matteo
ba24a77f-31a6-4c05-a647-babf8f660440
Reece, Steven
b79cac5b-bbd2-4038-b47d-3d4c845802aa
Kohli, Pushmeet
ede0d0ca-fe91-49c2-8c42-ee0fa4298e33
Guiver, John
8df5d373-1101-4b08-86bb-26a1cbab1c07
Roberts, Stephen
fef5d01c-92bd-44cf-93f0-923ec24f8875
Jennings, Nicholas R.
ab3d94cc-247c-4545-9d1e-65873d6cdb30

Simpson, Edwin, Venanzi, Matteo, Reece, Steven, Kohli, Pushmeet, Guiver, John, Roberts, Stephen and Jennings, Nicholas R. (2015) Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning. 24th International World Wide Web Conference (WWW 2015). pp. 992-1002 . (doi:10.1145/2736277.2741689).

Record type: Conference or Workshop Item (Paper)

Abstract

Social media has led to the democratisation of opinion sharing. A wealth of information about public opinions, current events, and authors’ insights into specific topics can be gained by understanding the text written by users. However, there is a wide variation in the language used by different authors in different contexts on the web. This diversity in language makes interpretation an extremely challenging task. Crowdsourcing presents an opportunity to interpret the sentiment, or topic, of free-text. However, the subjectivity and bias of human interpreters raise challenges in inferring the semantics expressed by the text. To overcome this problem, we present a novel Bayesian approach to language understanding that relies on aggregated crowdsourced judgements. Our model encodes the relationships between labels and text features in documents, such as tweets, web articles, and blog posts, accounting for the varying reliability of human labellers. It allows inference of annotations that scales to arbitrarily large pools of documents. Our evaluation shows that by efficiently exploiting language models learnt from aggregated crowdsourced labels, we can provide up to 25% improved classifications when only a small portion, less than 4% of documents has been labelled. Compared to the six state-of-the-art methods, we reduce by up to 67% the number of crowd responses required to achieve comparable accuracy. Our method was a joint winner of the CrowdFlower - CrowdScale 2013 Shared Task challenge at the conference on Human Computation and Crowdsourcing (HCOMP 2013).

Text
WWW15-BCCWords.pdf - Other
Download (2MB)

More information

Accepted/In Press date: January 2015
Published date: 2015
Venue - Dates: 24th International World Wide Web Conference (WWW 2015), 2015-01-01
Organisations: Agents, Interactions & Complexity

Identifiers

Local EPrints ID: 373949
URI: http://eprints.soton.ac.uk/id/eprint/373949
PURE UUID: 22e9ab48-9734-4aac-ba20-2a2d410201e8

Catalogue record

Date deposited: 30 Jan 2015 14:07
Last modified: 14 Mar 2024 18:59

Export record

Altmetrics

Contributors

Author: Edwin Simpson
Author: Matteo Venanzi
Author: Steven Reece
Author: Pushmeet Kohli
Author: John Guiver
Author: Stephen Roberts
Author: Nicholas R. Jennings

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×