The University of Southampton
University of Southampton Institutional Repository

An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes

An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes
An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes

Background: In binary logistic regression data are ‘separable’ if there exists a linear combination of explanatory variables which perfectly predicts the observed outcome, leading to non-existence of some of the maximum likelihood coefficient estimates. A popular solution to obtain finite estimates even with separable data is Firth’s logistic regression (FL), which was originally proposed to reduce the bias in coefficient estimates. The question of convergence becomes more involved when analyzing clustered data as frequently encountered in clinical research, e.g. data collected in several study centers or when individuals contribute multiple observations, using marginal logistic regression models fitted by generalized estimating equations (GEE). From our experience we suspect that separable data are a sufficient, but not a necessary condition for non-convergence of GEE. Thus, we expect that generalizations of approaches that can handle separable uncorrelated data may reduce but not fully remove the non-convergence issues of GEE. Methods: We investigate one recently proposed and two new extensions of FL to GEE. With ‘penalized GEE’ the GEE are treated as score equations, i.e. as derivatives of a log-likelihood set to zero, which are then modified as in FL. We introduce two approaches motivated by the equivalence of FL and maximum likelihood estimation with iteratively augmented data. Specifically, we consider fully iterated and single-step versions of this ‘augmented GEE’ approach. We compare the three approaches with respect to convergence behavior, practical applicability and performance using simulated data and a real data example. Results: Our simulations indicate that all three extensions of FL to GEE substantially improve convergence compared to ordinary GEE, while showing a similar or even better performance in terms of accuracy of coefficient estimates and predictions. Penalized GEE often slightly outperforms the augmented GEE approaches, but this comes at the cost of a higher burden of implementation. Conclusions: When fitting marginal logistic regression models using GEE on sparse data we recommend to apply penalized GEE if one has access to a suitable software implementation and single-step augmented GEE otherwise.

Clustered data, Firth's logistic regression, Generalized estimating equations, Logistic regression, Non-convergence, Separation
1471-2288
Geroldinger, Angelika
90fa8d96-ac9c-4938-92c5-af4a6cf5f41f
Blagus, Rok
367af7a4-f282-487f-bf79-c6c8b6c8de13
Ogden, Helen
78b03322-3836-4d3b-8b84-faf12895854e
Heinze, Georg
264b1d56-0b24-4807-a537-fda350acc7c0
Geroldinger, Angelika
90fa8d96-ac9c-4938-92c5-af4a6cf5f41f
Blagus, Rok
367af7a4-f282-487f-bf79-c6c8b6c8de13
Ogden, Helen
78b03322-3836-4d3b-8b84-faf12895854e
Heinze, Georg
264b1d56-0b24-4807-a537-fda350acc7c0

Geroldinger, Angelika, Blagus, Rok, Ogden, Helen and Heinze, Georg (2022) An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes. BMC Medical Research Methodology, 22 (1), [168]. (doi:10.1186/s12874-022-01641-6).

Record type: Article

Abstract

Background: In binary logistic regression data are ‘separable’ if there exists a linear combination of explanatory variables which perfectly predicts the observed outcome, leading to non-existence of some of the maximum likelihood coefficient estimates. A popular solution to obtain finite estimates even with separable data is Firth’s logistic regression (FL), which was originally proposed to reduce the bias in coefficient estimates. The question of convergence becomes more involved when analyzing clustered data as frequently encountered in clinical research, e.g. data collected in several study centers or when individuals contribute multiple observations, using marginal logistic regression models fitted by generalized estimating equations (GEE). From our experience we suspect that separable data are a sufficient, but not a necessary condition for non-convergence of GEE. Thus, we expect that generalizations of approaches that can handle separable uncorrelated data may reduce but not fully remove the non-convergence issues of GEE. Methods: We investigate one recently proposed and two new extensions of FL to GEE. With ‘penalized GEE’ the GEE are treated as score equations, i.e. as derivatives of a log-likelihood set to zero, which are then modified as in FL. We introduce two approaches motivated by the equivalence of FL and maximum likelihood estimation with iteratively augmented data. Specifically, we consider fully iterated and single-step versions of this ‘augmented GEE’ approach. We compare the three approaches with respect to convergence behavior, practical applicability and performance using simulated data and a real data example. Results: Our simulations indicate that all three extensions of FL to GEE substantially improve convergence compared to ordinary GEE, while showing a similar or even better performance in terms of accuracy of coefficient estimates and predictions. Penalized GEE often slightly outperforms the augmented GEE approaches, but this comes at the cost of a higher burden of implementation. Conclusions: When fitting marginal logistic regression models using GEE on sparse data we recommend to apply penalized GEE if one has access to a suitable software implementation and single-step augmented GEE otherwise.

Text
s12874-022-01641-6 - Version of Record
Available under License Creative Commons Attribution.
Download (2MB)

More information

Accepted/In Press date: 23 May 2022
Published date: December 2022
Additional Information: Funding Information: This work has partly been funded by the Austrian Science Fund (FWF), award I2276-N33. Publisher Copyright: © 2022, The Author(s).
Keywords: Clustered data, Firth's logistic regression, Generalized estimating equations, Logistic regression, Non-convergence, Separation

Identifiers

Local EPrints ID: 458078
URI: http://eprints.soton.ac.uk/id/eprint/458078
ISSN: 1471-2288
PURE UUID: 94978fc5-1b10-4320-b043-83df9c0fd56b
ORCID for Helen Ogden: ORCID iD orcid.org/0000-0001-7204-9776

Catalogue record

Date deposited: 28 Jun 2022 16:37
Last modified: 17 Mar 2024 03:33

Export record

Altmetrics

Contributors

Author: Angelika Geroldinger
Author: Rok Blagus
Author: Helen Ogden ORCID iD
Author: Georg Heinze

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×