The University of Southampton
University of Southampton Institutional Repository

Outlier detection and subspace learning via structured low rank approximation with applications to omic data

Outlier detection and subspace learning via structured low rank approximation with applications to omic data
Outlier detection and subspace learning via structured low rank approximation with applications to omic data
Dimensionality reduction is crucial when dealing with data with very high dimensionality and low number of samples. This is the case with genomic data where sequencing many genes is much easier than gathering many different samples. The main problem with high-dimensional data is that statistical inference and traditional pattern recognition techniques would break down or give misleading results. Therefore, we need to reduce the dimensionality of the data before extracting any useful information from it. A widely used dimensionality reduction technique is Principal Component Analysis (PCA). However, it is known from the literature that this method breaks down in the presence of even a small number of outliers in the data. We have reason to believe that outliers are present in genomic data due to shortcomings from the used experimental equipments, sensor malfunctions, and mistakes in the sample gathering processes. Moreover, outliers could be samples that are of interest in the problem that is being investigated, and need to be retained for further investigation.
In this work we will investigate low rank approximation methods that are robust to outliers, much of which have been already introduced in the machine learning community, and they are formulated as convex optimization problems. The main advantage of the convexity of this problems, is that it can be solved iteratively in an efficient way using first order optimization algorithms. However, outlier robust low rank approximation models, such as Outlier Pursuit (OP), that is optimal for high-dimensional genomic datasets, assume that the data lies approximately along a low-dimensional linear subspace; which is a strong assumption when dealing with gene expression or any biological dataset. Inspired by previous work in the computer vision community, we exploit the usefulness of adding a graph regularization term to OP, by building a graph between the data points to model the local geometry structure of the input data. This algorithm is called Graph regularized Outlier Pursuit (GOP), and it has the beneficial advantage of being a convex optimization problem. We will show the effectiveness in outlier detection and low-dimensional visualization of both techniques on high-dimensional genomic datasets. Furthermore, we show here that GOP and OP give better outlier detection results than traditional density based methods used for anomaly detection. Moreover, we will show the enhanced visualization capability of GOP when compared to OP, PCA, and t-distributed Stochastic Neighbour embedding (t-SNE).
Stemming from GOP, this work also proposes as novel method for multi-view clustering based on subspace learning, dubbed Convex Graph regularized Robust Multi-view Subspace Learning (CGRMSL). CGRMSL is robust to outliers and incorporates the non-linearities present in the different views. Moreover, the proposed multi-view method is also based on a convex objective function which guarantees a global optimal solution. We will investigate the power of this novel method on cancer multi-omic datasets for applications such as: cancer subtype clustering and cancer subtype discovery.
University of Southampton
Shetta, Omar Essam
168fd473-4857-42ce-8c4a-b4e83740462b
Shetta, Omar Essam
168fd473-4857-42ce-8c4a-b4e83740462b
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f

Shetta, Omar Essam (2020) Outlier detection and subspace learning via structured low rank approximation with applications to omic data. University of Southampton, Doctoral Thesis, 163pp.

Record type: Thesis (Doctoral)

Abstract

Dimensionality reduction is crucial when dealing with data with very high dimensionality and low number of samples. This is the case with genomic data where sequencing many genes is much easier than gathering many different samples. The main problem with high-dimensional data is that statistical inference and traditional pattern recognition techniques would break down or give misleading results. Therefore, we need to reduce the dimensionality of the data before extracting any useful information from it. A widely used dimensionality reduction technique is Principal Component Analysis (PCA). However, it is known from the literature that this method breaks down in the presence of even a small number of outliers in the data. We have reason to believe that outliers are present in genomic data due to shortcomings from the used experimental equipments, sensor malfunctions, and mistakes in the sample gathering processes. Moreover, outliers could be samples that are of interest in the problem that is being investigated, and need to be retained for further investigation.
In this work we will investigate low rank approximation methods that are robust to outliers, much of which have been already introduced in the machine learning community, and they are formulated as convex optimization problems. The main advantage of the convexity of this problems, is that it can be solved iteratively in an efficient way using first order optimization algorithms. However, outlier robust low rank approximation models, such as Outlier Pursuit (OP), that is optimal for high-dimensional genomic datasets, assume that the data lies approximately along a low-dimensional linear subspace; which is a strong assumption when dealing with gene expression or any biological dataset. Inspired by previous work in the computer vision community, we exploit the usefulness of adding a graph regularization term to OP, by building a graph between the data points to model the local geometry structure of the input data. This algorithm is called Graph regularized Outlier Pursuit (GOP), and it has the beneficial advantage of being a convex optimization problem. We will show the effectiveness in outlier detection and low-dimensional visualization of both techniques on high-dimensional genomic datasets. Furthermore, we show here that GOP and OP give better outlier detection results than traditional density based methods used for anomaly detection. Moreover, we will show the enhanced visualization capability of GOP when compared to OP, PCA, and t-distributed Stochastic Neighbour embedding (t-SNE).
Stemming from GOP, this work also proposes as novel method for multi-view clustering based on subspace learning, dubbed Convex Graph regularized Robust Multi-view Subspace Learning (CGRMSL). CGRMSL is robust to outliers and incorporates the non-linearities present in the different views. Moreover, the proposed multi-view method is also based on a convex objective function which guarantees a global optimal solution. We will investigate the power of this novel method on cancer multi-omic datasets for applications such as: cancer subtype clustering and cancer subtype discovery.

Text
Omar_Essam_Shetta_PhD_Thesis
Available under License University of Southampton Thesis Licence.
Download (2MB)
Text
PTD_thesis_Shetta-SIGNED
Restricted to Repository staff only

More information

Published date: September 2020

Identifiers

Local EPrints ID: 446975
URI: http://eprints.soton.ac.uk/id/eprint/446975
PURE UUID: 145d340d-f48f-4875-89e5-e7a0e22b21db
ORCID for Mahesan Niranjan: ORCID iD orcid.org/0000-0001-7021-140X

Catalogue record

Date deposited: 01 Mar 2021 17:31
Last modified: 17 Mar 2024 03:11

Export record

Contributors

Author: Omar Essam Shetta
Thesis advisor: Mahesan Niranjan ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×