The University of Southampton
University of Southampton Institutional Repository

The use of classification tree techniques for missing item imputation

The use of classification tree techniques for missing item imputation
The use of classification tree techniques for missing item imputation

This thesis compares different methods for imputing item non-response present in census information based on classification. The strategy for carrying out the imputation is divided in two steps. First, the data set is classified using a Tree-Based Technique, and second, the imputation is made using some of the known imputation methods.

The "Classification and Regression Tree" (CART) technique used for tree-based modelling is basically a set of classification rules (recursive binary segmentation) that partition the data set into mutually exhaustive and non-overlapping subsets (terminal nodes) based on the values of a group of explanatory variables. These subsets are expected to be internally more homogeneous with respect to the response variable (the variable (the variable for which the tree is generated) than the whole database. Once the classification is made, each imputation method is applied independently within each terminal node. Three common imputation methods for categorical data are used.

The combination of classification and imputation makes possible the assessment of the following aspects: 1) the effect of using this classification technique on the imputation results (including the use of different tree-sizes), and 2) the accuracy of the different imputation methods based on this classification technique.

Some general conclusions are obtained from the simulations: 1) the use of the classification tree as a method for creating imputation cells before the imputation is carried out does improve the imputation results, although the size of the tree does not have a major impact on the results. 2) most of the imputation procedures used in the simulation produce unbiased estimates for the total and for the variance, additionally, they have a very high values for the coverage as well as low values for the relative Mean Square Error, 3) in general, the best performing method is the Frequency Distribution method (even when compared with Sequential Hot Deck imputation).

University of Southampton
Mesa-Avila, Dulce M
32e5fdd5-513f-42a1-bedd-adc508fdb958
Mesa-Avila, Dulce M
32e5fdd5-513f-42a1-bedd-adc508fdb958

Mesa-Avila, Dulce M (2002) The use of classification tree techniques for missing item imputation. University of Southampton, Doctoral Thesis.

Record type: Thesis (Doctoral)

Abstract

This thesis compares different methods for imputing item non-response present in census information based on classification. The strategy for carrying out the imputation is divided in two steps. First, the data set is classified using a Tree-Based Technique, and second, the imputation is made using some of the known imputation methods.

The "Classification and Regression Tree" (CART) technique used for tree-based modelling is basically a set of classification rules (recursive binary segmentation) that partition the data set into mutually exhaustive and non-overlapping subsets (terminal nodes) based on the values of a group of explanatory variables. These subsets are expected to be internally more homogeneous with respect to the response variable (the variable (the variable for which the tree is generated) than the whole database. Once the classification is made, each imputation method is applied independently within each terminal node. Three common imputation methods for categorical data are used.

The combination of classification and imputation makes possible the assessment of the following aspects: 1) the effect of using this classification technique on the imputation results (including the use of different tree-sizes), and 2) the accuracy of the different imputation methods based on this classification technique.

Some general conclusions are obtained from the simulations: 1) the use of the classification tree as a method for creating imputation cells before the imputation is carried out does improve the imputation results, although the size of the tree does not have a major impact on the results. 2) most of the imputation procedures used in the simulation produce unbiased estimates for the total and for the variance, additionally, they have a very high values for the coverage as well as low values for the relative Mean Square Error, 3) in general, the best performing method is the Frequency Distribution method (even when compared with Sequential Hot Deck imputation).

Text
842317.pdf - Version of Record
Available under License University of Southampton Thesis Licence.
Download (22MB)

More information

Published date: 2002

Identifiers

Local EPrints ID: 464648
URI: http://eprints.soton.ac.uk/id/eprint/464648
PURE UUID: 767e3e91-1f8e-4081-a108-b1dbfd0971ce

Catalogue record

Date deposited: 04 Jul 2022 23:53
Last modified: 16 Mar 2024 19:40

Export record

Contributors

Author: Dulce M Mesa-Avila

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×