The University of Southampton
University of Southampton Institutional Repository

Linkage-data linear regression

Linkage-data linear regression
Linkage-data linear regression

Data linkage is increasingly being used to combine data from different sources with the aim of identifying and bringing together records from separate files, which correspond to the same entities. Usually, data linkage is not a trivial procedure and linkage errors, false and missed links, are unavoidable. In these cases, standard statistical techniques may produce misleading inference. In this paper, we propose a method for secondary linear regression analysis, where the linked data have to be prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. We develop also a diagnostic test for the assumption of non-informative linkage errors, which is required for all existing secondary analysis adjustment methods. Our approach provides important advantages: it relies on the realistic assumption that the probabilities of correct linkage vary across the records but it does not assume that one is able to estimate the probability of correct linkage for each individual record. Moreover, it accommodates in a simple manner the general situation where the files are of different sizes and none of them is a subset of another. The proposed methodology of adjustment and testing is studied by simulation and applied to real data.

data integration, diagnostic test, linkage error, method of least squares, record linkage
0964-1998
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Tuoto, Tiziana
35bc017d-1c9a-42a0-8ff2-9f5b425fdcb2
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Tuoto, Tiziana
35bc017d-1c9a-42a0-8ff2-9f5b425fdcb2

Zhang, Li-Chun and Tuoto, Tiziana (2020) Linkage-data linear regression. Journal of the Royal Statistical Society: Series A (Statistics in Society). (doi:10.1111/rssa.12630).

Record type: Article

Abstract

Data linkage is increasingly being used to combine data from different sources with the aim of identifying and bringing together records from separate files, which correspond to the same entities. Usually, data linkage is not a trivial procedure and linkage errors, false and missed links, are unavoidable. In these cases, standard statistical techniques may produce misleading inference. In this paper, we propose a method for secondary linear regression analysis, where the linked data have to be prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. We develop also a diagnostic test for the assumption of non-informative linkage errors, which is required for all existing secondary analysis adjustment methods. Our approach provides important advantages: it relies on the realistic assumption that the probabilities of correct linkage vary across the records but it does not assume that one is able to estimate the probability of correct linkage for each individual record. Moreover, it accommodates in a simple manner the general situation where the files are of different sizes and none of them is a subset of another. The proposed methodology of adjustment and testing is studied by simulation and applied to real data.

Text
Linkage data linear regression final - Accepted Manuscript
Download (559kB)

More information

Accepted/In Press date: 28 September 2020
e-pub ahead of print date: 11 November 2020
Additional Information: Publisher Copyright: © 2020 Royal Statistical Society
Keywords: data integration, diagnostic test, linkage error, method of least squares, record linkage

Identifiers

Local EPrints ID: 442296
URI: http://eprints.soton.ac.uk/id/eprint/442296
ISSN: 0964-1998
PURE UUID: a340fea7-8d1c-4375-a7be-e274794d0911
ORCID for Li-Chun Zhang: ORCID iD orcid.org/0000-0002-3944-9484

Catalogue record

Date deposited: 13 Jul 2020 16:30
Last modified: 17 Mar 2024 05:44

Export record

Altmetrics

Contributors

Author: Li-Chun Zhang ORCID iD
Author: Tiziana Tuoto

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×