The University of Southampton
University of Southampton Institutional Repository

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.
Dimensionality reduction, Genomics, High-dimensional data, Outlier detection
2054-5703
Shetta, Omar
168fd473-4857-42ce-8c4a-b4e83740462b
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Shetta, Omar
168fd473-4857-42ce-8c4a-b4e83740462b
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Shetta, Omar and Niranjan, Mahesan (2020) Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality. Royal Society Open Science, 7 (2), [190714]. (doi:10.1098/rsos.190714).

Record type: Article

Abstract

The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.

Text
rsos.190714 - Version of Record
Available under License Creative Commons Attribution.
Download (1MB)

More information

Accepted/In Press date: 12 December 2019
e-pub ahead of print date: 5 February 2020
Published date: 5 February 2020
Additional Information: Funding Information: Data accessibility. All data used are publicly available. Matlab code is available on GitHub. https://github.com/ omarshetta/Manuscript_Royal_Society. Authors’ contributions. O.S. and M.N. jointly designed the study, O.S. carried out the complete simulations and both authors interpreted the results and wrote the manuscript. Competing interests. We declare we have no competing interests. Funding. O.S. was supported by Engineering and Physical Sciences Research Council (EPSRC) and M.N.’s contribution was funded by the EPSRC project: from data to inference (EP/N014189/1). Acknowledgements. No one contributed to the study that does not meet authorship criteria. Publisher Copyright: © 2020 The Authors.
Keywords: Dimensionality reduction, Genomics, High-dimensional data, Outlier detection

Identifiers

Local EPrints ID: 437876
URI: http://eprints.soton.ac.uk/id/eprint/437876
ISSN: 2054-5703
PURE UUID: 7d8f7193-309b-4847-8a59-90a9c7c4b4d3
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X

Catalogue record

Date deposited: 21 Feb 2020 17:31
Last modified: 17 Mar 2024 03:11

Export record

Altmetrics

Contributors

Author: Omar Shetta
Author: Mahesan Niranjan ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×