The University of Southampton
University of Southampton Institutional Repository

Consequences of ignoring clustering in linear regression

Consequences of ignoring clustering in linear regression
Consequences of ignoring clustering in linear regression
Background: clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions.

Methods: we simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate.

Results: we found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered.

Conclusions: in this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥0.01).
Bias, Clustering, Comparison, Consequences, Linear regression, Random intercept model, Simulation
1471-2288
Ntani, Georgia
9b009e0a-5ab2-4c6e-a9fd-15a601e92be5
Inskip, Hazel
5fb4470a-9379-49b2-a533-9da8e61058b7
Osmond, Clive
2677bf85-494f-4a78-adf8-580e1b8acb81
Coggon, David
2b43ce0a-cc61-4d86-b15d-794208ffa5d3
Ntani, Georgia
9b009e0a-5ab2-4c6e-a9fd-15a601e92be5
Inskip, Hazel
5fb4470a-9379-49b2-a533-9da8e61058b7
Osmond, Clive
2677bf85-494f-4a78-adf8-580e1b8acb81
Coggon, David
2b43ce0a-cc61-4d86-b15d-794208ffa5d3

Ntani, Georgia, Inskip, Hazel, Osmond, Clive and Coggon, David (2021) Consequences of ignoring clustering in linear regression. BMC Medical Research Methodology, 21 (1), [139]. (doi:10.1186/s12874-021-01333-7).

Record type: Article

Abstract

Background: clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions.

Methods: we simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate.

Results: we found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered.

Conclusions: in this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥0.01).

Text
Consequences of ignoring clustering - clean - accepted - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (108kB)

More information

Accepted/In Press date: 14 June 2021
e-pub ahead of print date: 7 July 2021
Published date: 7 July 2021
Additional Information: Funding Information: During completion of this work, GN was supported by the Colt Foundation (PhD scholarship) and a grant award from Versus Arthritis (formerly Arthritis Research UK) (22090). The funding bodies were not involved in the study design, data analysis, interpretation of results or in writing the manuscript. Publisher Copyright: © 2021, The Author(s).
Keywords: Bias, Clustering, Comparison, Consequences, Linear regression, Random intercept model, Simulation

Identifiers

Local EPrints ID: 449978
URI: http://eprints.soton.ac.uk/id/eprint/449978
ISSN: 1471-2288
PURE UUID: 324e70f3-c9a7-44ef-825f-cc93abd6d1ce
ORCID for Hazel Inskip: ORCID iD orcid.org/0000-0001-8897-1749
ORCID for Clive Osmond: ORCID iD orcid.org/0000-0002-9054-4655
ORCID for David Coggon: ORCID iD orcid.org/0000-0003-1930-3987

Catalogue record

Date deposited: 01 Jul 2021 16:30
Last modified: 17 Mar 2024 06:39

Export record

Altmetrics

Contributors

Author: Georgia Ntani
Author: Hazel Inskip ORCID iD
Author: Clive Osmond ORCID iD
Author: David Coggon ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×