Multiple imputation for missing data and statistical
disclosure control for mixed-mode data using a
sequence of generalised linear models
Multiple imputation for missing data and statistical
disclosure control for mixed-mode data using a
sequence of generalised linear models
Multiple imputation is a commonly used approach to deal with missing data and to protect confidentiality of public use data sets. The basic idea is to replace the missing values or sensitive values with multiple imputation, and we then release the multiply imputed data sets to the public. Users can analyze the multiply imputed data sets and obtain valid inferences by using simple combining rules, which take the uncertainty due to the presence of missing values and synthetic values into account. It is crucial that imputations are drawn from the posterior predictive distribution to preserve relationships present in the data and allow valid conclusions to be made from any analysis. In data sets with different types of variables, e.g. some categorical and some continuous variables, multivariate imputation by chained equations (MICE) (Van Buuren (2011)) is a commonly used multiple imputation method. However, imputations from such an approach are not necessarily drawn from a proper posterior predictive distribution. We propose a method, called factored regression model (FRM) to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models.
We use data augmentation methods to connect the categorical and continuous variables and this allows us to draw imputations from a proper posterior distribution. We compare the performance of our method with MICE using simulation studies and on a breastfeeding data. We also extend our modelling strategies to incorporate different informative priors for the FRM to explore robust regression modelling and the sparse relationships between the predictors. We then apply our model to protect confidentiality of the current population survey (CPS) data by generating multiply imputed, partially synthetic data sets. These data sets comprise a mix of original data and the synthetic data where values chosen for synthesis are based on an approach that considers unique and sensitive units in the survey. Valid inference can then be made using the combining rules described by Reiter (2003). An extension to the modelling strategy is also introduced to deal with the presence of spikes at zero in some of the continuous variables in the CPS data.
Lee, Min Cherng
b70e8b2c-74e0-4125-bf23-b60076990a8d
June 2014
Lee, Min Cherng
b70e8b2c-74e0-4125-bf23-b60076990a8d
Mitra, Robin
2b944cd7-5be8-4dd1-ab44-f8ada9a33405
Lee, Min Cherng
(2014)
Multiple imputation for missing data and statistical
disclosure control for mixed-mode data using a
sequence of generalised linear models.
University of Southampton, Mathematics, Doctoral Thesis, 194pp.
Record type:
Thesis
(Doctoral)
Abstract
Multiple imputation is a commonly used approach to deal with missing data and to protect confidentiality of public use data sets. The basic idea is to replace the missing values or sensitive values with multiple imputation, and we then release the multiply imputed data sets to the public. Users can analyze the multiply imputed data sets and obtain valid inferences by using simple combining rules, which take the uncertainty due to the presence of missing values and synthetic values into account. It is crucial that imputations are drawn from the posterior predictive distribution to preserve relationships present in the data and allow valid conclusions to be made from any analysis. In data sets with different types of variables, e.g. some categorical and some continuous variables, multivariate imputation by chained equations (MICE) (Van Buuren (2011)) is a commonly used multiple imputation method. However, imputations from such an approach are not necessarily drawn from a proper posterior predictive distribution. We propose a method, called factored regression model (FRM) to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models.
We use data augmentation methods to connect the categorical and continuous variables and this allows us to draw imputations from a proper posterior distribution. We compare the performance of our method with MICE using simulation studies and on a breastfeeding data. We also extend our modelling strategies to incorporate different informative priors for the FRM to explore robust regression modelling and the sparse relationships between the predictors. We then apply our model to protect confidentiality of the current population survey (CPS) data by generating multiply imputed, partially synthetic data sets. These data sets comprise a mix of original data and the synthetic data where values chosen for synthesis are based on an approach that considers unique and sensitive units in the survey. Valid inference can then be made using the combining rules described by Reiter (2003). An extension to the modelling strategy is also introduced to deal with the presence of spikes at zero in some of the continuous variables in the CPS data.
Text
__soton.ac.uk_ude_personalfiles_users_kkb1_mydesktop_Min Cherng Lee-PhD thesis.pdf
- Other
More information
Published date: June 2014
Organisations:
University of Southampton, Statistics
Identifiers
Local EPrints ID: 366481
URI: http://eprints.soton.ac.uk/id/eprint/366481
PURE UUID: 0d7f1704-8d4b-4386-b01f-b8a3e786aee5
Catalogue record
Date deposited: 15 Oct 2014 11:38
Last modified: 14 Mar 2024 17:09
Export record
Contributors
Author:
Min Cherng Lee
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics