The University of Southampton
University of Southampton Institutional Repository

Maximum entropy classification for record linkage

Maximum entropy classification for record linkage
Maximum entropy classification for record linkage
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Probabilistic linkage, Density ratio, False link, Missing match, Survey sampling
0714-0045
Lee, Danhyang
ef6212e1-153d-4ef3-8a36-a11306dc3e92
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Kim, Jae-Kwang
40181ac7-eb51-4248-b9d8-73d520f7c5d7
Lee, Danhyang
ef6212e1-153d-4ef3-8a36-a11306dc3e92
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Kim, Jae-Kwang
40181ac7-eb51-4248-b9d8-73d520f7c5d7

Lee, Danhyang, Zhang, Li-Chun and Kim, Jae-Kwang (2021) Maximum entropy classification for record linkage. Survey Methodology. (In Press)

Record type: Article

Abstract

By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in machine learning to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.

Text
main_revision_R2_V2 - Accepted Manuscript
Download (496kB)

More information

Accepted/In Press date: 12 November 2021
Keywords: Probabilistic linkage, Density ratio, False link, Missing match, Survey sampling

Identifiers

Local EPrints ID: 452261
URI: http://eprints.soton.ac.uk/id/eprint/452261
ISSN: 0714-0045
PURE UUID: c33bfc86-c929-4e34-9393-505b96a7cff4
ORCID for Li-Chun Zhang: ORCID iD orcid.org/0000-0002-3944-9484

Catalogue record

Date deposited: 02 Dec 2021 17:33
Last modified: 17 Mar 2024 03:30

Export record

Contributors

Author: Danhyang Lee
Author: Li-Chun Zhang ORCID iD
Author: Jae-Kwang Kim

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×