The use of classification tree techniques for missing item imputation
The use of classification tree techniques for missing item imputation
This thesis compares different methods for imputing item non-response present in census information based on classification. The strategy for carrying out the imputation is divided in two steps. First, the data set is classified using a Tree-Based Technique, and second, the imputation is made using some of the known imputation methods.
The "Classification and Regression Tree" (CART) technique used for tree-based modelling is basically a set of classification rules (recursive binary segmentation) that partition the data set into mutually exhaustive and non-overlapping subsets (terminal nodes) based on the values of a group of explanatory variables. These subsets are expected to be internally more homogeneous with respect to the response variable (the variable (the variable for which the tree is generated) than the whole database. Once the classification is made, each imputation method is applied independently within each terminal node. Three common imputation methods for categorical data are used.
The combination of classification and imputation makes possible the assessment of the following aspects: 1) the effect of using this classification technique on the imputation results (including the use of different tree-sizes), and 2) the accuracy of the different imputation methods based on this classification technique.
Some general conclusions are obtained from the simulations: 1) the use of the classification tree as a method for creating imputation cells before the imputation is carried out does improve the imputation results, although the size of the tree does not have a major impact on the results. 2) most of the imputation procedures used in the simulation produce unbiased estimates for the total and for the variance, additionally, they have a very high values for the coverage as well as low values for the relative Mean Square Error, 3) in general, the best performing method is the Frequency Distribution method (even when compared with Sequential Hot Deck imputation).
University of Southampton
Mesa-Avila, Dulce M
32e5fdd5-513f-42a1-bedd-adc508fdb958
2002
Mesa-Avila, Dulce M
32e5fdd5-513f-42a1-bedd-adc508fdb958
Mesa-Avila, Dulce M
(2002)
The use of classification tree techniques for missing item imputation.
University of Southampton, Doctoral Thesis.
Record type:
Thesis
(Doctoral)
Abstract
This thesis compares different methods for imputing item non-response present in census information based on classification. The strategy for carrying out the imputation is divided in two steps. First, the data set is classified using a Tree-Based Technique, and second, the imputation is made using some of the known imputation methods.
The "Classification and Regression Tree" (CART) technique used for tree-based modelling is basically a set of classification rules (recursive binary segmentation) that partition the data set into mutually exhaustive and non-overlapping subsets (terminal nodes) based on the values of a group of explanatory variables. These subsets are expected to be internally more homogeneous with respect to the response variable (the variable (the variable for which the tree is generated) than the whole database. Once the classification is made, each imputation method is applied independently within each terminal node. Three common imputation methods for categorical data are used.
The combination of classification and imputation makes possible the assessment of the following aspects: 1) the effect of using this classification technique on the imputation results (including the use of different tree-sizes), and 2) the accuracy of the different imputation methods based on this classification technique.
Some general conclusions are obtained from the simulations: 1) the use of the classification tree as a method for creating imputation cells before the imputation is carried out does improve the imputation results, although the size of the tree does not have a major impact on the results. 2) most of the imputation procedures used in the simulation produce unbiased estimates for the total and for the variance, additionally, they have a very high values for the coverage as well as low values for the relative Mean Square Error, 3) in general, the best performing method is the Frequency Distribution method (even when compared with Sequential Hot Deck imputation).
Text
842317.pdf
- Version of Record
More information
Published date: 2002
Identifiers
Local EPrints ID: 464648
URI: http://eprints.soton.ac.uk/id/eprint/464648
PURE UUID: 767e3e91-1f8e-4081-a108-b1dbfd0971ce
Catalogue record
Date deposited: 04 Jul 2022 23:53
Last modified: 16 Mar 2024 19:40
Export record
Contributors
Author:
Dulce M Mesa-Avila
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics