The University of Southampton
University of Southampton Institutional Repository

Topics of statistical analysis with social media data

Topics of statistical analysis with social media data
Topics of statistical analysis with social media data
This thesis investigates the use of social media data in social research from a statistical perspective. A broad review is given of how these data has been used by researchers from different disciplines and the extent and means of the investigations carried out with these data is assessed. Special attention has been given to the common obstacles faced by using social media data for statistical analysis and to the graph representation of these data that is generally available to the researcher and to its use for statistical inference.

Most of the literature about the use of social media data for statistical analysis is concerned with the fact that these data represent a non-random sample from the population of interest. We have instead highlighted another fundamental challenge presented by these data, which is, however, rarely taken explicitly into consideration. The problem is that the object of sampling and the unit of interest might be distinct. To tackle this problem, we have shown how two different approaches of statistical inference can be distinguished in the literature. Under each approach, we have provided a discussion about the target of inference and make explicit their limitations in relation with the statistical methods used. Our exposition offers a framework for dealing with unruly data sources.

However, the problems of non-random sample and various unavoidable non-sampling errors do not admit a universally valid statistical approach. One can cope with them if needed to, but one cannot really hope to solve these problems. Meanwhile the graph structure inherent of social media data (and other forms of big data) seems to us a more rewarding area of research.

We have investigated how to use the structure of the graph for estimation. The Horvitz-Thompson (HT) estimator operates by weighting each sample motif by the inverse of its inclusion probability. Generalising the work of Birnbaum and Sirken (1965), we demonstrated that infinite types of incidence weights can be constructed for unbiased estimation. We define the Incidence Weighting Estimator (IWE) as a large class of linear design-based unbiased estimators based on the edges of the Bipartite Incidence Graph (BIG), of which the HT estimator is a special case. This class of estimator has no equivalence in traditional list sampling.

More ways of using the incidence structure of the BIG for estimation has been explored and in doing so we enter in a completely new territory. We have investigated how to use the incidence structure of the BIG to estimate a total based on the sampling units, and, once we have obtained such estimator we have discussed if and how it can be used together with the IWE to improve the inference. We have also seen that it is possible to use the reverse incidence weights in combination with the incidence weights. The weights obtained in such ways, can be used to construct an unbiased estimator in both directions, although the idea seems somewhat impractical at the moment.

The final chapter wants to offer a flavour of what can be done under the BIG framework and inspire future research in this direction. The thesis is organised in four papers: the first paper discusses the current statistical analysis made using social media data, while the other three papers deal with the topic of graph sampling and estimation.
University of Southampton
Patone, Martina
51bbd4cc-1c19-4a64-a0c2-1534b076fa79
Patone, Martina
51bbd4cc-1c19-4a64-a0c2-1534b076fa79
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649

Patone, Martina (2020) Topics of statistical analysis with social media data. University of Southampton, Doctoral Thesis, 171pp.

Record type: Thesis (Doctoral)

Abstract

This thesis investigates the use of social media data in social research from a statistical perspective. A broad review is given of how these data has been used by researchers from different disciplines and the extent and means of the investigations carried out with these data is assessed. Special attention has been given to the common obstacles faced by using social media data for statistical analysis and to the graph representation of these data that is generally available to the researcher and to its use for statistical inference.

Most of the literature about the use of social media data for statistical analysis is concerned with the fact that these data represent a non-random sample from the population of interest. We have instead highlighted another fundamental challenge presented by these data, which is, however, rarely taken explicitly into consideration. The problem is that the object of sampling and the unit of interest might be distinct. To tackle this problem, we have shown how two different approaches of statistical inference can be distinguished in the literature. Under each approach, we have provided a discussion about the target of inference and make explicit their limitations in relation with the statistical methods used. Our exposition offers a framework for dealing with unruly data sources.

However, the problems of non-random sample and various unavoidable non-sampling errors do not admit a universally valid statistical approach. One can cope with them if needed to, but one cannot really hope to solve these problems. Meanwhile the graph structure inherent of social media data (and other forms of big data) seems to us a more rewarding area of research.

We have investigated how to use the structure of the graph for estimation. The Horvitz-Thompson (HT) estimator operates by weighting each sample motif by the inverse of its inclusion probability. Generalising the work of Birnbaum and Sirken (1965), we demonstrated that infinite types of incidence weights can be constructed for unbiased estimation. We define the Incidence Weighting Estimator (IWE) as a large class of linear design-based unbiased estimators based on the edges of the Bipartite Incidence Graph (BIG), of which the HT estimator is a special case. This class of estimator has no equivalence in traditional list sampling.

More ways of using the incidence structure of the BIG for estimation has been explored and in doing so we enter in a completely new territory. We have investigated how to use the incidence structure of the BIG to estimate a total based on the sampling units, and, once we have obtained such estimator we have discussed if and how it can be used together with the IWE to improve the inference. We have also seen that it is possible to use the reverse incidence weights in combination with the incidence weights. The weights obtained in such ways, can be used to construct an unbiased estimator in both directions, although the idea seems somewhat impractical at the moment.

The final chapter wants to offer a flavour of what can be done under the BIG framework and inspire future research in this direction. The thesis is organised in four papers: the first paper discusses the current statistical analysis made using social media data, while the other three papers deal with the topic of graph sampling and estimation.

Text
Thesis Final MP - Version of Record
Available under License University of Southampton Thesis Licence.
Download (1MB)
Text
Permission to deposit thesis_signed (1)_RW
Restricted to Repository staff only

More information

Published date: January 2020

Identifiers

Local EPrints ID: 443415
URI: http://eprints.soton.ac.uk/id/eprint/443415
PURE UUID: 7d0c2764-93e1-45aa-af73-8bf852569054
ORCID for Li-Chun Zhang: ORCID iD orcid.org/0000-0002-3944-9484

Catalogue record

Date deposited: 24 Aug 2020 16:35
Last modified: 17 Mar 2024 03:30

Export record

Contributors

Author: Martina Patone
Thesis advisor: Li-Chun Zhang ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×