On proxy variables and categorical data fusion
On proxy variables and categorical data fusion
The problem of inference about the joint distribution of two categorical variables based on knowledge or observations of their marginal distributions, to be referred to as categorical data fusion in this paper, is relevant in statistical matching, ecological inference, market research, and several other related fields. This article organizes the use of proxy variables, to be distinguished from other auxiliary variables, both in terms of their effects on the uncertainty of fusion and the techniques of fusion. A measure of the gains of efficiency is provided, which incorporates both the identification uncertainty associated with data fusion and the sampling uncertainty that arises when the theoretical bounds of the uncertainty space are unknown and need to be estimated. Several existing techniques for generating fusion distributions (or datasets) are described and some new ones proposed. Analysis of real-life data demonstrates empirically that proxy variables can make data fusion more precise and the constructed fusion distribution more plausible.
identification problem, sampling uncertainty, uncertainty analysis, fusion distribution, fusion data, proxy variable, relative efficiency
783-807
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
16 December 2015
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Zhang, Li-Chun
(2015)
On proxy variables and categorical data fusion.
Journal of Official Statistics, 31 (4), .
(doi:10.1515/jos-2015-0045).
Abstract
The problem of inference about the joint distribution of two categorical variables based on knowledge or observations of their marginal distributions, to be referred to as categorical data fusion in this paper, is relevant in statistical matching, ecological inference, market research, and several other related fields. This article organizes the use of proxy variables, to be distinguished from other auxiliary variables, both in terms of their effects on the uncertainty of fusion and the techniques of fusion. A measure of the gains of efficiency is provided, which incorporates both the identification uncertainty associated with data fusion and the sampling uncertainty that arises when the theoretical bounds of the uncertainty space are unknown and need to be estimated. Several existing techniques for generating fusion distributions (or datasets) are described and some new ones proposed. Analysis of real-life data demonstrates empirically that proxy variables can make data fusion more precise and the constructed fusion distribution more plausible.
Text
jos-2015-0045-published.pdf
- Version of Record
Available under License Other.
More information
Accepted/In Press date: 1 September 2015
Published date: 16 December 2015
Keywords:
identification problem, sampling uncertainty, uncertainty analysis, fusion distribution, fusion data, proxy variable, relative efficiency
Organisations:
Social Statistics & Demography
Identifiers
Local EPrints ID: 391010
URI: http://eprints.soton.ac.uk/id/eprint/391010
ISSN: 0282-423X
PURE UUID: 20e560f7-8d1b-4c3e-8563-53debf839695
Catalogue record
Date deposited: 06 Apr 2016 13:41
Last modified: 15 Mar 2024 03:45
Export record
Altmetrics
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics