Assessing identification risk in survey microdata using log-linear models
Assessing identification risk in survey microdata using log-linear models
This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical “key“ variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well established in the literature, little consideration has been given to model specification or to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodness-of-fit tests, provide little guidance regarding the accuracy of risk estimation for the very sparse tables generated by typical applications at ONS, for example, tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100,000. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of “reasonable“ models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria that detect “underfitting“ (associated with overestimation of the risk). The criteria may also reveal “overfitting“ (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. Our criteria turn out to be related to established methods of testing for overdispersion in Poisson log-linear models. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined and show that a forward selection approach leads to good risk estimates. There are several “good“ models between which our approach provides little discrimination. The risk estimates are found to be stable across these models, implying a form of robustness. We also apply our approach to a large survey dataset. There is no indication that increasing the sample size necessarily leads to the selection of a more complex model. The risk estimates for this application display more variation but suggest a suitable upper bound.
989-1001
Skinner, Chris
dec5ef40-49ef-492a-8a1d-eb8c6315b8ce
Shlomo, Natalie
e749febc-b7b9-4017-be48-96d59dd03215
1 September 2008
Skinner, Chris
dec5ef40-49ef-492a-8a1d-eb8c6315b8ce
Shlomo, Natalie
e749febc-b7b9-4017-be48-96d59dd03215
Skinner, Chris and Shlomo, Natalie
(2008)
Assessing identification risk in survey microdata using log-linear models.
Journal of the American Statistical Association, 103 (483), .
(doi:10.1198/016214507000001328).
Abstract
This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical “key“ variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well established in the literature, little consideration has been given to model specification or to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodness-of-fit tests, provide little guidance regarding the accuracy of risk estimation for the very sparse tables generated by typical applications at ONS, for example, tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100,000. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of “reasonable“ models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria that detect “underfitting“ (associated with overestimation of the risk). The criteria may also reveal “overfitting“ (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. Our criteria turn out to be related to established methods of testing for overdispersion in Poisson log-linear models. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined and show that a forward selection approach leads to good risk estimates. There are several “good“ models between which our approach provides little discrimination. The risk estimates are found to be stable across these models, implying a form of robustness. We also apply our approach to a large survey dataset. There is no indication that increasing the sample size necessarily leads to the selection of a more complex model. The risk estimates for this application display more variation but suggest a suitable upper bound.
This record has no associated files available for download.
More information
Published date: 1 September 2008
Identifiers
Local EPrints ID: 51966
URI: http://eprints.soton.ac.uk/id/eprint/51966
ISSN: 0162-1459
PURE UUID: 403fe58a-4957-4294-9546-8cdc27d41186
Catalogue record
Date deposited: 13 May 2009
Last modified: 15 Mar 2024 10:19
Export record
Altmetrics
Contributors
Author:
Chris Skinner
Author:
Natalie Shlomo
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics