The use of classification tree techniques for missing item imputation

Mesa-Avila, Dulce M (2002) The use of classification tree techniques for missing item imputation. University of Southampton, Doctoral Thesis.

Record type: Thesis (Doctoral)

Abstract

This thesis compares different methods for imputing item non-response present in census information based on classification. The strategy for carrying out the imputation is divided in two steps. First, the data set is classified using a Tree-Based Technique, and second, the imputation is made using some of the known imputation methods.

The "Classification and Regression Tree" (CART) technique used for tree-based modelling is basically a set of classification rules (recursive binary segmentation) that partition the data set into mutually exhaustive and non-overlapping subsets (terminal nodes) based on the values of a group of explanatory variables. These subsets are expected to be internally more homogeneous with respect to the response variable (the variable (the variable for which the tree is generated) than the whole database. Once the classification is made, each imputation method is applied independently within each terminal node. Three common imputation methods for categorical data are used.

The combination of classification and imputation makes possible the assessment of the following aspects: 1) the effect of using this classification technique on the imputation results (including the use of different tree-sizes), and 2) the accuracy of the different imputation methods based on this classification technique.

Some general conclusions are obtained from the simulations: 1) the use of the classification tree as a method for creating imputation cells before the imputation is carried out does improve the imputation results, although the size of the tree does not have a major impact on the results. 2) most of the imputation procedures used in the simulation produce unbiased estimates for the total and for the variance, additionally, they have a very high values for the coverage as well as low values for the relative Mean Square Error, 3) in general, the best performing method is the Frequency Distribution method (even when compared with Sequential Hot Deck imputation).

Text

842317.pdf - Version of Record

Available under License University of Southampton Thesis Licence.

Download (22MB)