Maximum entropy classification for record linkage
Maximum entropy classification for record linkage
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Probabilistic linkage, Density ratio, False link, Missing match, Survey sampling
Lee, Danhyang
ef6212e1-153d-4ef3-8a36-a11306dc3e92
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Kim, Jae-Kwang
40181ac7-eb51-4248-b9d8-73d520f7c5d7
Lee, Danhyang
ef6212e1-153d-4ef3-8a36-a11306dc3e92
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Kim, Jae-Kwang
40181ac7-eb51-4248-b9d8-73d520f7c5d7
Lee, Danhyang, Zhang, Li-Chun and Kim, Jae-Kwang
(2021)
Maximum entropy classification for record linkage.
Survey Methodology.
(In Press)
Abstract
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Text
main_revision_R2_V2
- Accepted Manuscript
More information
Accepted/In Press date: 12 November 2021
Keywords:
Probabilistic linkage, Density ratio, False link, Missing match, Survey sampling
Identifiers
Local EPrints ID: 452261
URI: http://eprints.soton.ac.uk/id/eprint/452261
ISSN: 0714-0045
PURE UUID: c33bfc86-c929-4e34-9393-505b96a7cff4
Catalogue record
Date deposited: 02 Dec 2021 17:33
Last modified: 17 Mar 2024 03:30
Export record
Contributors
Author:
Danhyang Lee
Author:
Jae-Kwang Kim
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics