Outlier detection and subspace learning via structured low rank approximation with applications to omic data
Outlier detection and subspace learning via structured low rank approximation with applications to omic data
Dimensionality reduction is crucial when dealing with data with very high dimensionality and low number of samples. This is the case with genomic data where sequencing many genes is much easier than gathering many different samples. The main problem with high-dimensional data is that statistical inference and traditional pattern recognition techniques would break down or give misleading results. Therefore, we need to reduce the dimensionality of the data before extracting any useful information from it. A widely used dimensionality reduction technique is Principal Component Analysis (PCA). However, it is known from the literature that this method breaks down in the presence of even a small number of outliers in the data. We have reason to believe that outliers are present in genomic data due to shortcomings from the used experimental equipments, sensor malfunctions, and mistakes in the sample gathering processes. Moreover, outliers could be samples that are of interest in the problem that is being investigated, and need to be retained for further investigation.
In this work we will investigate low rank approximation methods that are robust to outliers, much of which have been already introduced in the machine learning community, and they are formulated as convex optimization problems. The main advantage of the convexity of this problems, is that it can be solved iteratively in an efficient way using first order optimization algorithms. However, outlier robust low rank approximation models, such as Outlier Pursuit (OP), that is optimal for high-dimensional genomic datasets, assume that the data lies approximately along a low-dimensional linear subspace; which is a strong assumption when dealing with gene expression or any biological dataset. Inspired by previous work in the computer vision community, we exploit the usefulness of adding a graph regularization term to OP, by building a graph between the data points to model the local geometry structure of the input data. This algorithm is called Graph regularized Outlier Pursuit (GOP), and it has the beneficial advantage of being a convex optimization problem. We will show the effectiveness in outlier detection and low-dimensional visualization of both techniques on high-dimensional genomic datasets. Furthermore, we show here that GOP and OP give better outlier detection results than traditional density based methods used for anomaly detection. Moreover, we will show the enhanced visualization capability of GOP when compared to OP, PCA, and t-distributed Stochastic Neighbour embedding (t-SNE).
Stemming from GOP, this work also proposes as novel method for multi-view clustering based on subspace learning, dubbed Convex Graph regularized Robust Multi-view Subspace Learning (CGRMSL). CGRMSL is robust to outliers and incorporates the non-linearities present in the different views. Moreover, the proposed multi-view method is also based on a convex objective function which guarantees a global optimal solution. We will investigate the power of this novel method on cancer multi-omic datasets for applications such as: cancer subtype clustering and cancer subtype discovery.
University of Southampton
Shetta, Omar Essam
168fd473-4857-42ce-8c4a-b4e83740462b
September 2020
Shetta, Omar Essam
168fd473-4857-42ce-8c4a-b4e83740462b
Niranjan, Mahesan
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Shetta, Omar Essam
(2020)
Outlier detection and subspace learning via structured low rank approximation with applications to omic data.
University of Southampton, Doctoral Thesis, 163pp.
Record type:
Thesis
(Doctoral)
Abstract
Dimensionality reduction is crucial when dealing with data with very high dimensionality and low number of samples. This is the case with genomic data where sequencing many genes is much easier than gathering many different samples. The main problem with high-dimensional data is that statistical inference and traditional pattern recognition techniques would break down or give misleading results. Therefore, we need to reduce the dimensionality of the data before extracting any useful information from it. A widely used dimensionality reduction technique is Principal Component Analysis (PCA). However, it is known from the literature that this method breaks down in the presence of even a small number of outliers in the data. We have reason to believe that outliers are present in genomic data due to shortcomings from the used experimental equipments, sensor malfunctions, and mistakes in the sample gathering processes. Moreover, outliers could be samples that are of interest in the problem that is being investigated, and need to be retained for further investigation.
In this work we will investigate low rank approximation methods that are robust to outliers, much of which have been already introduced in the machine learning community, and they are formulated as convex optimization problems. The main advantage of the convexity of this problems, is that it can be solved iteratively in an efficient way using first order optimization algorithms. However, outlier robust low rank approximation models, such as Outlier Pursuit (OP), that is optimal for high-dimensional genomic datasets, assume that the data lies approximately along a low-dimensional linear subspace; which is a strong assumption when dealing with gene expression or any biological dataset. Inspired by previous work in the computer vision community, we exploit the usefulness of adding a graph regularization term to OP, by building a graph between the data points to model the local geometry structure of the input data. This algorithm is called Graph regularized Outlier Pursuit (GOP), and it has the beneficial advantage of being a convex optimization problem. We will show the effectiveness in outlier detection and low-dimensional visualization of both techniques on high-dimensional genomic datasets. Furthermore, we show here that GOP and OP give better outlier detection results than traditional density based methods used for anomaly detection. Moreover, we will show the enhanced visualization capability of GOP when compared to OP, PCA, and t-distributed Stochastic Neighbour embedding (t-SNE).
Stemming from GOP, this work also proposes as novel method for multi-view clustering based on subspace learning, dubbed Convex Graph regularized Robust Multi-view Subspace Learning (CGRMSL). CGRMSL is robust to outliers and incorporates the non-linearities present in the different views. Moreover, the proposed multi-view method is also based on a convex objective function which guarantees a global optimal solution. We will investigate the power of this novel method on cancer multi-omic datasets for applications such as: cancer subtype clustering and cancer subtype discovery.
Text
Omar_Essam_Shetta_PhD_Thesis
Text
PTD_thesis_Shetta-SIGNED
Restricted to Repository staff only
More information
Published date: September 2020
Identifiers
Local EPrints ID: 446975
URI: http://eprints.soton.ac.uk/id/eprint/446975
PURE UUID: 145d340d-f48f-4875-89e5-e7a0e22b21db
Catalogue record
Date deposited: 01 Mar 2021 17:31
Last modified: 17 Mar 2024 03:11
Export record
Contributors
Author:
Omar Essam Shetta
Thesis advisor:
Mahesan Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics