Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning

Social media has led to the democratisation of opinion sharing. A wealth of information about public opinions, current events, and authors’ insights into specific topics can be gained by understanding the text written by users. However, there is a wide variation in the language used by different authors in different contexts on the web. This diversity in language makes interpretation an extremely challenging task. Crowdsourcing presents an opportunity to interpret the sentiment, or topic, of free-text. However, the subjectivity and bias of human interpreters raise challenges in inferring the semantics expressed by the text. To overcome this problem, we present a novel Bayesian approach to language understanding that relies on aggregated crowdsourced judgements. Our model encodes the relationships between labels and text features in documents, such as tweets, web articles, and blog posts, accounting for the varying reliability of human labellers. It allows inference of annotations that scales to arbitrarily large pools of documents. Our evaluation shows that by efficiently exploiting language models learnt from aggregated crowdsourced labels, we can provide up to 25% improved classifications when only a small portion, less than 4% of documents has been labelled. Compared to the six state-of-the-art methods, we reduce by up to 67% the number of crowd responses required to achieve comparable accuracy. Our method was a joint winner of the CrowdFlower - CrowdScale 2013 Shared Task challenge at the conference on Human Computation and Crowdsourcing (HCOMP 2013).

10.1145/2736277.2741689

992-1002

Simpson, Edwin

ebf1cc2d-6633-4182-ab5a-91c832816a97

Venanzi, Matteo

ba24a77f-31a6-4c05-a647-babf8f660440

Reece, Steven

b79cac5b-bbd2-4038-b47d-3d4c845802aa

Kohli, Pushmeet

ede0d0ca-fe91-49c2-8c42-ee0fa4298e33

Guiver, John

8df5d373-1101-4b08-86bb-26a1cbab1c07

Roberts, Stephen

fef5d01c-92bd-44cf-93f0-923ec24f8875

Jennings, Nicholas R.

ab3d94cc-247c-4545-9d1e-65873d6cdb30

2015

Simpson, Edwin

ebf1cc2d-6633-4182-ab5a-91c832816a97

Venanzi, Matteo

ba24a77f-31a6-4c05-a647-babf8f660440

Reece, Steven

b79cac5b-bbd2-4038-b47d-3d4c845802aa

Kohli, Pushmeet

ede0d0ca-fe91-49c2-8c42-ee0fa4298e33

Guiver, John

8df5d373-1101-4b08-86bb-26a1cbab1c07

Roberts, Stephen

fef5d01c-92bd-44cf-93f0-923ec24f8875

Jennings, Nicholas R.

ab3d94cc-247c-4545-9d1e-65873d6cdb30

Simpson, Edwin, Venanzi, Matteo, Reece, Steven, Kohli, Pushmeet, Guiver, John, Roberts, Stephen and Jennings, Nicholas R. (2015) Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning. 24th International World Wide Web Conference (WWW 2015). pp. 992-1002 . (doi:10.1145/2736277.2741689).

Record type: Conference or Workshop Item (Paper)

Abstract

Text

WWW15-BCCWords.pdf - Other

Download (2MB)

More information

Accepted/In Press date: January 2015

Published date: 2015

Venue - Dates: 24th International World Wide Web Conference (WWW 2015), 2015-01-01

Organisations: Agents, Interactions & Complexity

Learn more about the Agents, Interactions & Complexity

Identifiers

Local EPrints ID: 373949

URI: http://eprints.soton.ac.uk/id/eprint/373949

DOI: doi:10.1145/2736277.2741689

PURE UUID: 22e9ab48-9734-4aac-ba20-2a2d410201e8

Catalogue record

Date deposited: 30 Jan 2015 14:07

Last modified: 14 Mar 2024 18:59

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Edwin Simpson

Author: Matteo Venanzi

Author: Steven Reece

Author: Pushmeet Kohli

Author: John Guiver

Author: Stephen Roberts

Author: Nicholas R. Jennings

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information