Outlier detection and subspace learning via structured low rank approximation with applications to omic data

Shetta, Omar Essam (2020) Outlier detection and subspace learning via structured low rank approximation with applications to omic data. University of Southampton, Doctoral Thesis, 163pp.

Record type: Thesis (Doctoral)

Abstract

Dimensionality reduction is crucial when dealing with data with very high dimensionality and low number of samples. This is the case with genomic data where sequencing many genes is much easier than gathering many different samples. The main problem with high-dimensional data is that statistical inference and traditional pattern recognition techniques would break down or give misleading results. Therefore, we need to reduce the dimensionality of the data before extracting any useful information from it. A widely used dimensionality reduction technique is Principal Component Analysis (PCA). However, it is known from the literature that this method breaks down in the presence of even a small number of outliers in the data. We have reason to believe that outliers are present in genomic data due to shortcomings from the used experimental equipments, sensor malfunctions, and mistakes in the sample gathering processes. Moreover, outliers could be samples that are of interest in the problem that is being investigated, and need to be retained for further investigation.
In this work we will investigate low rank approximation methods that are robust to outliers, much of which have been already introduced in the machine learning community, and they are formulated as convex optimization problems. The main advantage of the convexity of this problems, is that it can be solved iteratively in an efficient way using first order optimization algorithms. However, outlier robust low rank approximation models, such as Outlier Pursuit (OP), that is optimal for high-dimensional genomic datasets, assume that the data lies approximately along a low-dimensional linear subspace; which is a strong assumption when dealing with gene expression or any biological dataset. Inspired by previous work in the computer vision community, we exploit the usefulness of adding a graph regularization term to OP, by building a graph between the data points to model the local geometry structure of the input data. This algorithm is called Graph regularized Outlier Pursuit (GOP), and it has the beneficial advantage of being a convex optimization problem. We will show the effectiveness in outlier detection and low-dimensional visualization of both techniques on high-dimensional genomic datasets. Furthermore, we show here that GOP and OP give better outlier detection results than traditional density based methods used for anomaly detection. Moreover, we will show the enhanced visualization capability of GOP when compared to OP, PCA, and t-distributed Stochastic Neighbour embedding (t-SNE).
Stemming from GOP, this work also proposes as novel method for multi-view clustering based on subspace learning, dubbed Convex Graph regularized Robust Multi-view Subspace Learning (CGRMSL). CGRMSL is robust to outliers and incorporates the non-linearities present in the different views. Moreover, the proposed multi-view method is also based on a convex objective function which guarantees a global optimal solution. We will investigate the power of this novel method on cancer multi-omic datasets for applications such as: cancer subtype clustering and cancer subtype discovery.

Text

Omar_Essam_Shetta_PhD_Thesis

Available under License University of Southampton Thesis Licence.

Download (2MB)

Text

PTD_thesis_Shetta-SIGNED

Restricted to Repository staff only