Topics of statistical analysis with social media data
Topics of statistical analysis with social media data
This thesis investigates the use of social media data in social research from a statistical perspective. A broad review is given of how these data has been used by researchers from different disciplines and the extent and means of the investigations carried out with these data is assessed. Special attention has been given to the common obstacles faced by using social media data for statistical analysis and to the graph representation of these data that is generally available to the researcher and to its use for statistical inference.
Most of the literature about the use of social media data for statistical analysis is concerned with the fact that these data represent a non-random sample from the population of interest. We have instead highlighted another fundamental challenge presented by these data, which is, however, rarely taken explicitly into consideration. The problem is that the object of sampling and the unit of interest might be distinct. To tackle this problem, we have shown how two different approaches of statistical inference can be distinguished in the literature. Under each approach, we have provided a discussion about the target of inference and make explicit their limitations in relation with the statistical methods used. Our exposition offers a framework for dealing with unruly data sources.
However, the problems of non-random sample and various unavoidable non-sampling errors do not admit a universally valid statistical approach. One can cope with them if needed to, but one cannot really hope to solve these problems. Meanwhile the graph structure inherent of social media data (and other forms of big data) seems to us a more rewarding area of research.
We have investigated how to use the structure of the graph for estimation. The Horvitz-Thompson (HT) estimator operates by weighting each sample motif by the inverse of its inclusion probability. Generalising the work of Birnbaum and Sirken (1965), we demonstrated that infinite types of incidence weights can be constructed for unbiased estimation. We define the Incidence Weighting Estimator (IWE) as a large class of linear design-based unbiased estimators based on the edges of the Bipartite Incidence Graph (BIG), of which the HT estimator is a special case. This class of estimator has no equivalence in traditional list sampling.
More ways of using the incidence structure of the BIG for estimation has been explored and in doing so we enter in a completely new territory. We have investigated how to use the incidence structure of the BIG to estimate a total based on the sampling units, and, once we have obtained such estimator we have discussed if and how it can be used together with the IWE to improve the inference. We have also seen that it is possible to use the reverse incidence weights in combination with the incidence weights. The weights obtained in such ways, can be used to construct an unbiased estimator in both directions, although the idea seems somewhat impractical at the moment.
The final chapter wants to offer a flavour of what can be done under the BIG framework and inspire future research in this direction. The thesis is organised in four papers: the first paper discusses the current statistical analysis made using social media data, while the other three papers deal with the topic of graph sampling and estimation.
University of Southampton
Patone, Martina
51bbd4cc-1c19-4a64-a0c2-1534b076fa79
January 2020
Patone, Martina
51bbd4cc-1c19-4a64-a0c2-1534b076fa79
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Patone, Martina
(2020)
Topics of statistical analysis with social media data.
University of Southampton, Doctoral Thesis, 171pp.
Record type:
Thesis
(Doctoral)
Abstract
This thesis investigates the use of social media data in social research from a statistical perspective. A broad review is given of how these data has been used by researchers from different disciplines and the extent and means of the investigations carried out with these data is assessed. Special attention has been given to the common obstacles faced by using social media data for statistical analysis and to the graph representation of these data that is generally available to the researcher and to its use for statistical inference.
Most of the literature about the use of social media data for statistical analysis is concerned with the fact that these data represent a non-random sample from the population of interest. We have instead highlighted another fundamental challenge presented by these data, which is, however, rarely taken explicitly into consideration. The problem is that the object of sampling and the unit of interest might be distinct. To tackle this problem, we have shown how two different approaches of statistical inference can be distinguished in the literature. Under each approach, we have provided a discussion about the target of inference and make explicit their limitations in relation with the statistical methods used. Our exposition offers a framework for dealing with unruly data sources.
However, the problems of non-random sample and various unavoidable non-sampling errors do not admit a universally valid statistical approach. One can cope with them if needed to, but one cannot really hope to solve these problems. Meanwhile the graph structure inherent of social media data (and other forms of big data) seems to us a more rewarding area of research.
We have investigated how to use the structure of the graph for estimation. The Horvitz-Thompson (HT) estimator operates by weighting each sample motif by the inverse of its inclusion probability. Generalising the work of Birnbaum and Sirken (1965), we demonstrated that infinite types of incidence weights can be constructed for unbiased estimation. We define the Incidence Weighting Estimator (IWE) as a large class of linear design-based unbiased estimators based on the edges of the Bipartite Incidence Graph (BIG), of which the HT estimator is a special case. This class of estimator has no equivalence in traditional list sampling.
More ways of using the incidence structure of the BIG for estimation has been explored and in doing so we enter in a completely new territory. We have investigated how to use the incidence structure of the BIG to estimate a total based on the sampling units, and, once we have obtained such estimator we have discussed if and how it can be used together with the IWE to improve the inference. We have also seen that it is possible to use the reverse incidence weights in combination with the incidence weights. The weights obtained in such ways, can be used to construct an unbiased estimator in both directions, although the idea seems somewhat impractical at the moment.
The final chapter wants to offer a flavour of what can be done under the BIG framework and inspire future research in this direction. The thesis is organised in four papers: the first paper discusses the current statistical analysis made using social media data, while the other three papers deal with the topic of graph sampling and estimation.
Text
Thesis Final MP
- Version of Record
Text
Permission to deposit thesis_signed (1)_RW
Restricted to Repository staff only
More information
Published date: January 2020
Identifiers
Local EPrints ID: 443415
URI: http://eprints.soton.ac.uk/id/eprint/443415
PURE UUID: 7d0c2764-93e1-45aa-af73-8bf852569054
Catalogue record
Date deposited: 24 Aug 2020 16:35
Last modified: 17 Mar 2024 03:30
Export record
Contributors
Author:
Martina Patone
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics