Linkage-data linear regression
Linkage-data linear regression
Data linkage is increasingly being used to combine data from different sources with the aim of identifying and bringing together records from separate files, which correspond to the same entities. Usually, data linkage is not a trivial procedure and linkage errors, false and missed links, are unavoidable. In these cases, standard statistical techniques may produce misleading inference. In this paper, we propose a method for secondary linear regression analysis, where the linked data have to be prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. We develop also a diagnostic test for the assumption of non-informative linkage errors, which is required for all existing secondary analysis adjustment methods. Our approach provides important advantages: it relies on the realistic assumption that the probabilities of correct linkage vary across the records but it does not assume that one is able to estimate the probability of correct linkage for each individual record. Moreover, it accommodates in a simple manner the general situation where the files are of different sizes and none of them is a subset of another. The proposed methodology of adjustment and testing is studied by simulation and applied to real data.
data integration, diagnostic test, linkage error, method of least squares, record linkage
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Tuoto, Tiziana
35bc017d-1c9a-42a0-8ff2-9f5b425fdcb2
Zhang, Li-Chun
a5d48518-7f71-4ed9-bdcb-6585c2da3649
Tuoto, Tiziana
35bc017d-1c9a-42a0-8ff2-9f5b425fdcb2
Zhang, Li-Chun and Tuoto, Tiziana
(2020)
Linkage-data linear regression.
Journal of the Royal Statistical Society: Series A (Statistics in Society).
(doi:10.1111/rssa.12630).
Abstract
Data linkage is increasingly being used to combine data from different sources with the aim of identifying and bringing together records from separate files, which correspond to the same entities. Usually, data linkage is not a trivial procedure and linkage errors, false and missed links, are unavoidable. In these cases, standard statistical techniques may produce misleading inference. In this paper, we propose a method for secondary linear regression analysis, where the linked data have to be prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. We develop also a diagnostic test for the assumption of non-informative linkage errors, which is required for all existing secondary analysis adjustment methods. Our approach provides important advantages: it relies on the realistic assumption that the probabilities of correct linkage vary across the records but it does not assume that one is able to estimate the probability of correct linkage for each individual record. Moreover, it accommodates in a simple manner the general situation where the files are of different sizes and none of them is a subset of another. The proposed methodology of adjustment and testing is studied by simulation and applied to real data.
Text
Linkage data linear regression final
- Accepted Manuscript
More information
Accepted/In Press date: 28 September 2020
e-pub ahead of print date: 11 November 2020
Additional Information:
Publisher Copyright:
© 2020 Royal Statistical Society
Keywords:
data integration, diagnostic test, linkage error, method of least squares, record linkage
Identifiers
Local EPrints ID: 442296
URI: http://eprints.soton.ac.uk/id/eprint/442296
ISSN: 0964-1998
PURE UUID: a340fea7-8d1c-4375-a7be-e274794d0911
Catalogue record
Date deposited: 13 Jul 2020 16:30
Last modified: 17 Mar 2024 05:44
Export record
Altmetrics
Contributors
Author:
Tiziana Tuoto
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics