Predicting the demographics of Twitter users with programmatic weak supervision
Predicting the demographics of Twitter users with programmatic weak supervision
Predicting the demographics of Twitter users has become a problem with a large interest in computational social sciences. However, the limited amount of public datasets with ground truth labels and the tremendous costs of hand-labeling make this task particularly challenging. Recently, programmatic weak supervision has emerged as a new framework to train classifiers on noisy data with minimal human labeling effort. In this paper, demographic prediction is framed for the first time as a programmatic weak supervision problem. A new three-step methodology for gender, age category, and location prediction is provided, which outperforms traditional programmatic weak supervision and is competitive with the state-of the-art deep learning model. The study is performed in Flanders, a small Dutch speaking European region, characterized by a limited number of user profiles and tweets. An evaluation conducted on an independent hand-labeled test set shows that the proposed methodology can be generalized to unseen users within the geographic area of interest.
354–390
Tonglet, Jonathan
4f72888a-9922-41e5-b8c0-c2ad5c68e0df
Jehoul, Astrid
857dec54-b86e-426a-8d13-296128c3737c
Reusens, Manon
3dc14c4b-793a-41d6-b7bd-64303cda1c42
Reusens, Michael
4264e5fa-ed9c-4446-ae74-a4248ae94a49
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Tonglet, Jonathan
4f72888a-9922-41e5-b8c0-c2ad5c68e0df
Jehoul, Astrid
857dec54-b86e-426a-8d13-296128c3737c
Reusens, Manon
3dc14c4b-793a-41d6-b7bd-64303cda1c42
Reusens, Michael
4264e5fa-ed9c-4446-ae74-a4248ae94a49
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Tonglet, Jonathan, Jehoul, Astrid, Reusens, Manon, Reusens, Michael and Baesens, Bart
(2024)
Predicting the demographics of Twitter users with programmatic weak supervision.
International Transactions in Operational Research, 32, .
(doi:10.1007/s11750-024-00666-y).
Abstract
Predicting the demographics of Twitter users has become a problem with a large interest in computational social sciences. However, the limited amount of public datasets with ground truth labels and the tremendous costs of hand-labeling make this task particularly challenging. Recently, programmatic weak supervision has emerged as a new framework to train classifiers on noisy data with minimal human labeling effort. In this paper, demographic prediction is framed for the first time as a programmatic weak supervision problem. A new three-step methodology for gender, age category, and location prediction is provided, which outperforms traditional programmatic weak supervision and is competitive with the state-of the-art deep learning model. The study is performed in Flanders, a small Dutch speaking European region, characterized by a limited number of user profiles and tweets. An evaluation conducted on an independent hand-labeled test set shows that the proposed methodology can be generalized to unseen users within the geographic area of interest.
Text
Paper_submission_TOP (2)
- Accepted Manuscript
Restricted to Repository staff only until 22 January 2026.
Request a copy
More information
Accepted/In Press date: 22 January 2024
e-pub ahead of print date: 8 February 2024
Identifiers
Local EPrints ID: 486509
URI: http://eprints.soton.ac.uk/id/eprint/486509
ISSN: 0969-6016
PURE UUID: 8f351bcd-d18d-496b-9aef-61265f6e5efb
Catalogue record
Date deposited: 24 Jan 2024 17:59
Last modified: 29 Sep 2025 17:48
Export record
Altmetrics
Contributors
Author:
Jonathan Tonglet
Author:
Astrid Jehoul
Author:
Manon Reusens
Author:
Michael Reusens
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics