Multiple imputation for missing data and statistical disclosure control for mixed-mode data using a sequence of generalised linear models

Lee, Min Cherng (2014) Multiple imputation for missing data and statistical disclosure control for mixed-mode data using a sequence of generalised linear models. University of Southampton, Mathematics, Doctoral Thesis, 194pp.

Record type: Thesis (Doctoral)

Abstract

Multiple imputation is a commonly used approach to deal with missing data and to protect confidentiality of public use data sets. The basic idea is to replace the missing values or sensitive values with multiple imputation, and we then release the multiply imputed data sets to the public. Users can analyze the multiply imputed data sets and obtain valid inferences by using simple combining rules, which take the uncertainty due to the presence of missing values and synthetic values into account. It is crucial that imputations are drawn from the posterior predictive distribution to preserve relationships present in the data and allow valid conclusions to be made from any analysis. In data sets with different types of variables, e.g. some categorical and some continuous variables, multivariate imputation by chained equations (MICE) (Van Buuren (2011)) is a commonly used multiple imputation method. However, imputations from such an approach are not necessarily drawn from a proper posterior predictive distribution. We propose a method, called factored regression model (FRM) to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models.

We use data augmentation methods to connect the categorical and continuous variables and this allows us to draw imputations from a proper posterior distribution. We compare the performance of our method with MICE using simulation studies and on a breastfeeding data. We also extend our modelling strategies to incorporate different informative priors for the FRM to explore robust regression modelling and the sparse relationships between the predictors. We then apply our model to protect confidentiality of the current population survey (CPS) data by generating multiply imputed, partially synthetic data sets. These data sets comprise a mix of original data and the synthetic data where values chosen for synthesis are based on an approach that considers unique and sensitive units in the survey. Valid inference can then be made using the combining rules described by Reiter (2003). An extension to the modelling strategy is also introduced to deal with the presence of spikes at zero in some of the continuous variables in the CPS data.

Text

__soton.ac.uk_ude_personalfiles_users_kkb1_mydesktop_Min Cherng Lee-PhD thesis.pdf - Other

Download (2MB)

More information

Published date: June 2014

Organisations: University of Southampton, Statistics

Identifiers

Local EPrints ID: 366481

URI: http://eprints.soton.ac.uk/id/eprint/366481

PURE UUID: 0d7f1704-8d4b-4386-b01f-b8a3e786aee5

Catalogue record

Date deposited: 15 Oct 2014 11:38

Last modified: 14 Mar 2024 17:09

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Min Cherng Lee

Thesis advisor: Robin Mitra

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information