Collinearity: a review of methods to deal with it and a simulation study evaluating their performance
Collinearity: a review of methods to deal with it and a simulation study evaluating their performance
Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.
027-046
Dormann, Carsten F.
9a72cfd4-4087-4cde-9c8e-025d1dfd2308
Elith, Jane
1c617fd1-2e6c-40ba-bf77-4a24ace454a2
Bacher, Sven
7f625a03-126e-4899-9a03-4a4a768c7e9f
Buchmann, Carsten
9ef487a7-61a9-4223-b32b-97735dfd3a5a
Carl, Gudrun
df621188-8f57-4a0b-846d-a710fe49f50f
Carré, Gabriel
e2560e2a-00d3-4eed-bc92-4633b6fac353
Marquéz, Jaime R. García
a89f0840-44d4-49cf-bb7f-6bdadaf31ff0
Gruber, Bernd
a22bda15-6405-43fd-8c24-5d5dbaa6d505
Lafourcade, Bruno
c44ca1d2-5093-439f-aa7c-630ecc05ab11
Leitão, Pedro J.
b906d7a2-6455-4de8-91ca-e48ef826e17e
Münkemüller, Tamara
2af67523-3f9d-4432-b092-7eb6503236ae
McClean, Colin
fd9319cc-33b1-46a6-add1-43375e2ea822
Osborne, Patrick E.
c4d4261d-557c-4179-a24e-cdd7a98fb2b8
Reineking, Björn
9b5d940d-8e55-4385-abe9-70bb9c25732e
Schröder, Boris
3e751bed-5601-4c27-b201-0b7021a13cc3
Skidmore, Andrew K.
c96fb647-5394-44ec-b154-9c42c195bb71
Zurell, Damaris
2a605b76-5b83-49f1-9fdd-6ab93d00daca
Lautenbach, Sven
43d6d919-6f6c-4e18-8988-3f25e5bea6a1
January 2013
Dormann, Carsten F.
9a72cfd4-4087-4cde-9c8e-025d1dfd2308
Elith, Jane
1c617fd1-2e6c-40ba-bf77-4a24ace454a2
Bacher, Sven
7f625a03-126e-4899-9a03-4a4a768c7e9f
Buchmann, Carsten
9ef487a7-61a9-4223-b32b-97735dfd3a5a
Carl, Gudrun
df621188-8f57-4a0b-846d-a710fe49f50f
Carré, Gabriel
e2560e2a-00d3-4eed-bc92-4633b6fac353
Marquéz, Jaime R. García
a89f0840-44d4-49cf-bb7f-6bdadaf31ff0
Gruber, Bernd
a22bda15-6405-43fd-8c24-5d5dbaa6d505
Lafourcade, Bruno
c44ca1d2-5093-439f-aa7c-630ecc05ab11
Leitão, Pedro J.
b906d7a2-6455-4de8-91ca-e48ef826e17e
Münkemüller, Tamara
2af67523-3f9d-4432-b092-7eb6503236ae
McClean, Colin
fd9319cc-33b1-46a6-add1-43375e2ea822
Osborne, Patrick E.
c4d4261d-557c-4179-a24e-cdd7a98fb2b8
Reineking, Björn
9b5d940d-8e55-4385-abe9-70bb9c25732e
Schröder, Boris
3e751bed-5601-4c27-b201-0b7021a13cc3
Skidmore, Andrew K.
c96fb647-5394-44ec-b154-9c42c195bb71
Zurell, Damaris
2a605b76-5b83-49f1-9fdd-6ab93d00daca
Lautenbach, Sven
43d6d919-6f6c-4e18-8988-3f25e5bea6a1
Dormann, Carsten F., Elith, Jane, Bacher, Sven, Buchmann, Carsten, Carl, Gudrun, Carré, Gabriel, Marquéz, Jaime R. García, Gruber, Bernd, Lafourcade, Bruno, Leitão, Pedro J., Münkemüller, Tamara, McClean, Colin, Osborne, Patrick E., Reineking, Björn, Schröder, Boris, Skidmore, Andrew K., Zurell, Damaris and Lautenbach, Sven
(2013)
Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.
Ecography, 36 (1), .
(doi:10.1111/j.1600-0587.2012.07348.x).
Abstract
Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.
This record has no associated files available for download.
More information
e-pub ahead of print date: 18 May 2012
Published date: January 2013
Organisations:
Civil Maritime & Env. Eng & Sci Unit
Identifiers
Local EPrints ID: 350382
URI: http://eprints.soton.ac.uk/id/eprint/350382
ISSN: 0906-7590
PURE UUID: 1af578c7-6333-4acd-9bcf-acb113803669
Catalogue record
Date deposited: 22 Mar 2013 12:20
Last modified: 15 Mar 2024 03:21
Export record
Altmetrics
Contributors
Author:
Carsten F. Dormann
Author:
Jane Elith
Author:
Sven Bacher
Author:
Carsten Buchmann
Author:
Gudrun Carl
Author:
Gabriel Carré
Author:
Jaime R. García Marquéz
Author:
Bernd Gruber
Author:
Bruno Lafourcade
Author:
Pedro J. Leitão
Author:
Tamara Münkemüller
Author:
Colin McClean
Author:
Björn Reineking
Author:
Boris Schröder
Author:
Andrew K. Skidmore
Author:
Damaris Zurell
Author:
Sven Lautenbach
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics