The University of Southampton
University of Southampton Institutional Repository

Using Twitter data for demographic research

Using Twitter data for demographic research
Using Twitter data for demographic research
Background: social media data is a promising source of social science data. However, deriving the demographic characteristics of users and dealing with the nonrandom, nonrepresentative populations from which they are drawn represent challenges for social scientists.

Objective: given the growing use of social media data in social science research, this paper asks two questions: 1) To what extent are findings obtained with social media data generalizable to broader populations, and 2) what is the best practice for estimating demographic information from Twitter data?

Methods: our analyses use information gathered from 979,992 geo-located Tweets sent by 22,356 unique users in South East England between 23 June and 4 July 2014. We estimate demographic characteristics of the Twitter users with the crowd-sourcing platform CrowdFlower and the image-recognition software Face++. To evaluate bias in the data, we run a series of log-linear models with offsets and calibrate the nonrepresentative sample of Twitter users with mid-year population estimates for South East England.

Results: CrowdFlower proves to be more accurate than Face++ for the measurement of age, whereas both tools are highly reliable for measuring the sex of Twitter users. The calibration exercise allows bias correction in the age-, sex-, and location-specific population counts obtained from the Twitter population by augmenting Twitter data with mid-year population estimates.

Contribution: the paper proposes best practices for estimating Twitter users’ basic demographic characteristics and a calibration method to address the selection bias in the Twitter population, allowing researchers to generalize findings based on Twitter to the general population.
1477–1514
Yildiz, Dilek
71798192-b121-4cd0-9025-7ad5131ac6d5
Munson, Joanna, Elizabeth
0ad62230-70a6-4d04-8b71-6d2e9e6dd282
Vitali, Agnese
56acb6b8-5161-4106-9e73-20712840d675
Tinati, Ramine
4102a244-c312-4d57-88c2-d219d9f8d69a
Holland, Jennifer
53f89965-1900-4972-9d74-8d9c659676bb
Yildiz, Dilek
71798192-b121-4cd0-9025-7ad5131ac6d5
Munson, Joanna, Elizabeth
0ad62230-70a6-4d04-8b71-6d2e9e6dd282
Vitali, Agnese
56acb6b8-5161-4106-9e73-20712840d675
Tinati, Ramine
4102a244-c312-4d57-88c2-d219d9f8d69a
Holland, Jennifer
53f89965-1900-4972-9d74-8d9c659676bb

Yildiz, Dilek, Munson, Joanna, Elizabeth, Vitali, Agnese, Tinati, Ramine and Holland, Jennifer (2017) Using Twitter data for demographic research. Demographic Research, 37, 1477–1514. (doi:10.4054/DemRes.2017.37.46).

Record type: Article

Abstract

Background: social media data is a promising source of social science data. However, deriving the demographic characteristics of users and dealing with the nonrandom, nonrepresentative populations from which they are drawn represent challenges for social scientists.

Objective: given the growing use of social media data in social science research, this paper asks two questions: 1) To what extent are findings obtained with social media data generalizable to broader populations, and 2) what is the best practice for estimating demographic information from Twitter data?

Methods: our analyses use information gathered from 979,992 geo-located Tweets sent by 22,356 unique users in South East England between 23 June and 4 July 2014. We estimate demographic characteristics of the Twitter users with the crowd-sourcing platform CrowdFlower and the image-recognition software Face++. To evaluate bias in the data, we run a series of log-linear models with offsets and calibrate the nonrepresentative sample of Twitter users with mid-year population estimates for South East England.

Results: CrowdFlower proves to be more accurate than Face++ for the measurement of age, whereas both tools are highly reliable for measuring the sex of Twitter users. The calibration exercise allows bias correction in the age-, sex-, and location-specific population counts obtained from the Twitter population by augmenting Twitter data with mid-year population estimates.

Contribution: the paper proposes best practices for estimating Twitter users’ basic demographic characteristics and a calibration method to address the selection bias in the Twitter population, allowing researchers to generalize findings based on Twitter to the general population.

Text
37-46 - Version of Record
Available under License Creative Commons Attribution.
Download (1MB)

More information

Accepted/In Press date: 1 April 2016
e-pub ahead of print date: 22 November 2017
Published date: 22 November 2017

Identifiers

Local EPrints ID: 415799
URI: https://eprints.soton.ac.uk/id/eprint/415799
PURE UUID: d854def7-16bb-46e7-a5a1-113d138ab294
ORCID for Joanna, Elizabeth Munson: ORCID iD orcid.org/0000-0003-1050-2795
ORCID for Agnese Vitali: ORCID iD orcid.org/0000-0003-0029-9447

Catalogue record

Date deposited: 24 Nov 2017 17:30
Last modified: 08 Oct 2019 00:36

Export record

Altmetrics

Contributors

Author: Dilek Yildiz
Author: Joanna, Elizabeth Munson ORCID iD
Author: Agnese Vitali ORCID iD
Author: Ramine Tinati
Author: Jennifer Holland

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×