Assessing identification risk in survey microdata using log-linear models
Assessing identification risk in survey microdata using log-linear models
This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical 'key' variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well-established in the literature, little consideration has been given to model specification nor to the sensitivity of risk assessment to this specification. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of 'reasonable' models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria to detect 'underfitting' (associated with overestimation of the risk). The criteria may also reveal 'overfitting' (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined. We also apply our approach to a large survey dataset.
confidentiality, disclosure, key variable, matching, model specification
University of Southampton, Southampton Statistical Sciences Research Institute
Skinner, Chris
dec5ef40-49ef-492a-8a1d-eb8c6315b8ce
Shlomo, Natalie
e749febc-b7b9-4017-be48-96d59dd03215
6 October 2006
Skinner, Chris
dec5ef40-49ef-492a-8a1d-eb8c6315b8ce
Shlomo, Natalie
e749febc-b7b9-4017-be48-96d59dd03215
Skinner, Chris and Shlomo, Natalie
(2006)
Assessing identification risk in survey microdata using log-linear models
(S3RI Methodology Working Papers, M06/14)
Southampton, UK.
University of Southampton, Southampton Statistical Sciences Research Institute
36pp.
Record type:
Monograph
(Working Paper)
Abstract
This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical 'key' variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well-established in the literature, little consideration has been given to model specification nor to the sensitivity of risk assessment to this specification. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of 'reasonable' models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria to detect 'underfitting' (associated with overestimation of the risk). The criteria may also reveal 'overfitting' (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined. We also apply our approach to a large survey dataset.
Text
41842-01.pdf
- Author's Original
More information
Published date: 6 October 2006
Keywords:
confidentiality, disclosure, key variable, matching, model specification
Identifiers
Local EPrints ID: 41842
URI: http://eprints.soton.ac.uk/id/eprint/41842
PURE UUID: 496fbaa8-5594-489d-b45c-02b83fb2bf37
Catalogue record
Date deposited: 06 Oct 2006
Last modified: 09 Nov 2021 08:29
Export record
Contributors
Author:
Chris Skinner
Author:
Natalie Shlomo
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics