Datasets in support of the Southampton doctoral thesis 'Applying large scale metanalysis of transcriptomic data to uncover hyper-responsive genes and prediction via machine learning'

Coke, Brandon (2025) Datasets in support of the Southampton doctoral thesis 'Applying large scale metanalysis of transcriptomic data to uncover hyper-responsive genes and prediction via machine learning'. University of Southampton doi:10.5258/SOTON/D3221 [Dataset]

Record type: Dataset

Abstract

The SQLite databases contain the outputs from the large scale analysis of pre-existing RNA-seq and microarray datasets performed in chapter 2. Both SQLite databases contain the outputs of limma- a package used to perform differential expressed gene analysis on the datasets from Gene Expression Omnibus (GEO)- https://www.ncbi.nlm.nih.gov/geo/. The Schema for both databases are as follows- the data table contains the outputs and statistics from limma. The meta table contains metadata about the number of treated and control samples, the type of experiment conducted and the tissue used. These datasets where used to derive the priors used in chapters 3 to 5 based on the proportion of datasets wherein a given gene is identified as differentially expressed- i.e. p-value below 0.05. Die to the size of the file, this is only available on request, please use https://library.soton.ac.uk/datarequest The machine_learning_input.csv file is a comma delaminated file containing the genomic and transcript based features used to predict a gene's prior in the machine learning models. For more information see the readme file. The RNK files are tab delimited files. The .RNK files' first column is the gene whils the second is the rank from 1 to 0. These files were used to assess the enrichment of desired DEGs across 22 perturbation studies in chapter 2 using GSEA- https://www.gsea-msigdb.org/gsea/index.jsp. 1 represents a gene with the lowest rank- highest priority. Whilst 0 represents the lowest priority for a given gene. The .RDS images are the R images used for the novel GEOreflect approach for ranking DEGs in bulk transcriptomic data developed in chapter 3. They are also needed to run the RShiny application used to showcase the method. The code for which can be found at GitHub (https://github.com/brandoncoke/GEOreflect) as well ain in the GEOreflect_bulk_DEG_analysis.tar. The .RDS files require R and the readRDS() function to load into the environment and contains the percentile matrices used to calculate a platform p-value rank. Within the GEOreflect_bulk_DEG_analysis.tar file is an R script GEOreflect_functions.R which when sourced after loading one of the .RDS images into the R environment enables the user to perform the GEOreflect method on bulk RNA-seq transcriptomic datasets by loading the percentile_matrix_p_value_RNAseq.RDS image. Alternatively when analysing GPL570 microarray datasets the percentile_matrix.RDS file needs to be loaded into the R environment and the appropiate R function then needs to be applied the DEG list. To run the RShiny application ensure both .RDS files are in the directory with the app.R file i.e. after using git clone https://github.com/brandoncoke/GEOreflect move both .RDS files into the GEOreflect directory with the cloned repository. The csv files with the scRNA-seq appended. These files contain the normalised mutual index, adjusted rand index and Silhouette coefficeint obtained when using 6 single cell RNA-sequencing techniques- GEOreflect, Seurat's vst method, CellBRF, genebasis and CellBRF with the 3 sigma rule imposed. This analysis was carried out in chapter 3. These .csvs use their GEO identifier in the file name or for Zheng et al's data from genomics 10X. The name assigned to it via the DuoClustering2018 R package. The machine_learning_input.csv file is a comma delaminated file containing the genomic and transcript based features used to predict a gene's prior in the machine learning models. The inputs from this file were used to develop the machine learning models used in chapter 5. First row- gene is the HNGC identifier for the genes whilst the min_to_be_sig column represents a gene's CDF value at 0.05 for their p-value distribution obtained from the RNA-seq datasets i.e. the target y for the regressor model. The sd column is unused- and was only relevant when calculating the priors using GPL570 microarray data were there can be redundant probes resulting in multiple priors for the same gene. This column would represent the standard deviation.

Text

thesis_readme_D3221.txt - Text

Available under License Creative Commons Attribution.

Download (10kB)

Text

machine_learning_inputs.csv - Dataset

Download (5MB)