Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods
Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods
Recognising the various sources of nitrate pollution and understanding system dynamics are fundamental to tackle groundwater quality problems. A comprehensive GIS database of twenty parameters regarding hydrogeological and hydrological features and driving forces were used as inputs for predictive models of nitrate pollution. Additionally, key variables extracted from remotely sensed Normalised Difference Vegetation Index time-series (NDVI) were included in database to provide indications of agroecosystem dynamics. Many approaches can be used to evaluate feature importance related to groundwater pollution caused by nitrates. Filters, wrappers and embedded methods are used to rank feature importance according to the probability of occurrence of nitrates above a threshold value in groundwater. Machine learning algorithms (MLA) such as Classification and Regression Trees (CART), Random Forest (RF) and Support Vector Machines (SVM) are used as wrappers considering four different sequential search approaches: the sequential backward selection (SBS), the sequential forward selection (SFS), the sequential forward floating selection (SFFS) and sequential backward floating selection (SBFS). Feature importance obtained from RF and CART was used as an embedded approach. RF with SFFS had the best performance (mmce = 0.12 and AUC = 0.92) and good interpretability, where three features related to groundwater polluted areas were selected: i) industries and facilities rating according to their production capacity and total nitrogen emissions to water within a 3 km buffer, ii) livestock farms rating by manure production within a 5 km buffer and, iii) cumulated NDVI for the post-maximum month, being used as a proxy of vegetation productivity and crop yield.
Embedded methods, Feature selection, Groundwater, Machine learning algorithms, Nitrates, Wrapper methods
661-672
Rodriguez-Galiano, V. F.
44144f72-19cd-433e-be40-36a054d8fbf3
Luque-Espinar, J. A.
f466a352-0583-4b02-8e84-76ab0ee0b155
Chica-Olmo, M.
c7291c15-3b53-45d7-942c-06985f77d6f6
Mendes, M. P.
2ed2c148-7e6c-43ef-8cdd-ea668ed3a524
15 May 2018
Rodriguez-Galiano, V. F.
44144f72-19cd-433e-be40-36a054d8fbf3
Luque-Espinar, J. A.
f466a352-0583-4b02-8e84-76ab0ee0b155
Chica-Olmo, M.
c7291c15-3b53-45d7-942c-06985f77d6f6
Mendes, M. P.
2ed2c148-7e6c-43ef-8cdd-ea668ed3a524
Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M. and Mendes, M. P.
(2018)
Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods.
Science of the Total Environment, 624, .
(doi:10.1016/j.scitotenv.2017.12.152).
Abstract
Recognising the various sources of nitrate pollution and understanding system dynamics are fundamental to tackle groundwater quality problems. A comprehensive GIS database of twenty parameters regarding hydrogeological and hydrological features and driving forces were used as inputs for predictive models of nitrate pollution. Additionally, key variables extracted from remotely sensed Normalised Difference Vegetation Index time-series (NDVI) were included in database to provide indications of agroecosystem dynamics. Many approaches can be used to evaluate feature importance related to groundwater pollution caused by nitrates. Filters, wrappers and embedded methods are used to rank feature importance according to the probability of occurrence of nitrates above a threshold value in groundwater. Machine learning algorithms (MLA) such as Classification and Regression Trees (CART), Random Forest (RF) and Support Vector Machines (SVM) are used as wrappers considering four different sequential search approaches: the sequential backward selection (SBS), the sequential forward selection (SFS), the sequential forward floating selection (SFFS) and sequential backward floating selection (SBFS). Feature importance obtained from RF and CART was used as an embedded approach. RF with SFFS had the best performance (mmce = 0.12 and AUC = 0.92) and good interpretability, where three features related to groundwater polluted areas were selected: i) industries and facilities rating according to their production capacity and total nitrogen emissions to water within a 3 km buffer, ii) livestock farms rating by manure production within a 5 km buffer and, iii) cumulated NDVI for the post-maximum month, being used as a proxy of vegetation productivity and crop yield.
Text
Rodriguez-Galiano et al 2018
- Accepted Manuscript
More information
Accepted/In Press date: 13 December 2017
e-pub ahead of print date: 27 December 2017
Published date: 15 May 2018
Keywords:
Embedded methods, Feature selection, Groundwater, Machine learning algorithms, Nitrates, Wrapper methods
Identifiers
Local EPrints ID: 417339
URI: http://eprints.soton.ac.uk/id/eprint/417339
ISSN: 0048-9697
PURE UUID: ab8f374b-cb48-4132-85b9-aa39015627b7
Catalogue record
Date deposited: 30 Jan 2018 17:30
Last modified: 16 Mar 2024 06:10
Export record
Altmetrics
Contributors
Author:
V. F. Rodriguez-Galiano
Author:
J. A. Luque-Espinar
Author:
M. Chica-Olmo
Author:
M. P. Mendes
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics