Assessing identification risk in survey microdata using log-linear models

This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical “key“ variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well established in the literature, little consideration has been given to model specification or to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodness-of-fit tests, provide little guidance regarding the accuracy of risk estimation for the very sparse tables generated by typical applications at ONS, for example, tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100,000. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of “reasonable“ models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria that detect “underfitting“ (associated with overestimation of the risk). The criteria may also reveal “overfitting“ (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. Our criteria turn out to be related to established methods of testing for overdispersion in Poisson log-linear models. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined and show that a forward selection approach leads to good risk estimates. There are several “good“ models between which our approach provides little discrimination. The risk estimates are found to be stable across these models, implying a form of robustness. We also apply our approach to a large survey dataset. There is no indication that increasing the sample size necessarily leads to the selection of a more complex model. The risk estimates for this application display more variation but suggest a suitable upper bound.

10.1198/016214507000001328

0162-1459

989-1001

Skinner, Chris

dec5ef40-49ef-492a-8a1d-eb8c6315b8ce

Shlomo, Natalie

e749febc-b7b9-4017-be48-96d59dd03215

1 September 2008

Skinner, Chris

dec5ef40-49ef-492a-8a1d-eb8c6315b8ce

Shlomo, Natalie

e749febc-b7b9-4017-be48-96d59dd03215

Skinner, Chris and Shlomo, Natalie (2008) Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association, 103 (483), 989-1001. (doi:10.1198/016214507000001328).