READ ME File For 'Dataset title'

Dataset DOI: 10.5258/SOTON/D3533

ReadMe Author: Brandon Coke, University of Southampton https://orcid.org/0000-0002-0847-6885

This dataset supports the thesis entitled 
AWARDED BY: Univeristy of Southampton
DATE OF AWARD: 2025


Date of data collection: 2025
Information about geographic location of data collection: 

Licence:
CCBY

Related projects/Funders:
BBRC



--------------------
DATA & FILE OVERVIEW
--------------------

This dataset contains:

[The SQLite databases contain the outputs from the large scale analysis of pre-existing RNA-seq and microarray datasets performed in chapter 2. Both SQLite databases contain the outputs of limma- a package used to perform differential expressed gene analysis on the datasets from Gene Expression Omnibus (GEO)- https://www.ncbi.nlm.nih.gov/geo/. The Schema for both databases are as follows- the data table contains the outputs and statistics from limma. The meta table contains metadata about the number of treated and control samples, the type of experiment conducted and the tissue used. These datasets where used to derive the priors used in chapters 3 to 5 based on the proportion of datasets wherein a given gene is identified as differentially expressed- i.e. p-value below 0.05.

The RNK files are tab delimited files. The .RNK files' first column is the gene whils the second is the rank from 1 to 0. These files were used to assess the enrichment of desired DEGs across 22 perturbation studies in chapter 2 using GSEA- https://www.gsea-msigdb.org/gsea/index.jsp. 1 represents a gene with the lowest rank- highest priority. Whilst 0 represents the lowest priority for a given gene. 

The .RDS images are the R images used for the novel GEOreflect approach for ranking DEGs in bulk transcriptomic data developed in chapter 3. They are also needed to run the RShiny application used to showcase the method. The code for which can be found at GitHub (https://github.com/brandoncoke/GEOreflect) as well ain in the GEOreflect_bulk_DEG_analysis.tar. The .RDS files require R and the readRDS() function to load into the environment and contains the percentile matrices used to calculate a platform p-value rank. Within the GEOreflect_bulk_DEG_analysis.tar file is an R script GEOreflect_functions.R which when sourced after loading one of the .RDS images into the R environment enables the user to perform the GEOreflect method on bulk RNA-seq transcriptomic datasets by loading the percentile_matrix_p_value_RNAseq.RDS image. Alternatively when analysing GPL570 microarray datasets the percentile_matrix.RDS file needs to be loaded into the R environment and the appropiate R function then needs to be applied the DEG list. To run the RShiny application ensure both .RDS files are in the directory with the app.R file i.e. after using git clone https://github.com/brandoncoke/GEOreflect move both .RDS files into the GEOreflect directory with the cloned repository.

The csv files with the scRNA-seq appended. These files contain the normalised mutual index, adjusted rand index and Silhouette coefficeint obtained when using 6 single cell RNA-sequencing techniques- GEOreflect, Seurat's vst method, CellBRF, genebasis and CellBRF with the 3 sigma rule imposed. This analysis was carried out in chapter 3. These .csvs use their GEO identifier in the file name or for Zheng et al's data from genomics 10X. The name assigned to it via the DuoClustering2018 R package.

The machine_learning_input.csv file is a comma delaminated file containing the genomic and transcript based features used to predict a gene's prior in the machine learning models. The inputs from this file were used to develop the machine learning models used in chapter 5. First row- gene is the HNGC identifier for the genes whilst the min_to_be_sig column represents a gene's CDF value at 0.05 for their p-value distribution obtained from the RNA-seq datasets i.e. the target y for the regressor model. The sd column is unused- and was only relevant when calculating the priors using GPL570 microarray data were there can be redundant probes resulting in multiple priors for the same gene. This column would represent the standard deviation.]


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection/generation of data:
.RDS images for GEOreflect and RShiny application- Contains the .RDS file required to perform the GEOreflect reranking of bulk transcriptomic data. This file requires R and the readRDS() function to load into the environment. Alongside this it contains the GEOreflect_bulk_DEG_analysis.tar directory-  with the code also being found at https://github.com/brandoncoke/GEOreflect. Once decompressed it contians the GEOreflect_functions.R R script which requires R to use and has the code to perform the GEOreflect. Alongside this it contains the code to run an interactive R shiny application. Instructions to use it can be found here https://github.com/brandoncoke/GEOreflect. It requires R to use.

The .RDS images which require R and the  R and the readRDS() function to load into the environment. Once loaded- it contains a percentile matrix- i.e. the p-value at each percentile for a given gene. This is assign the second rank (platform rank) to the gene when using the GEOreflect method for re-ranking. The percentile_matrix_p_value_RNAseq.RDS file is the percentile matrix used when applying the GEOreflect method to RNA-seq datasets. The percentile_matrix.RDS file is the percentile matrix used when applying the GEOreflect method to GPL570 datasets. The GEOreflect_bulk_DEG_analysis.tar directory contains the code for the RShiny application with the instructions to run it being found here: https://github.com/brandoncoke/GEOreflect. This application requires the .RDS images in to run found in this dataset, R and dependencies found at the Git repository https://github.com/brandoncoke/GEOreflect.


.RNK files used for GSEA- The RNK files (tab delimited files with the gene as a the first column and rank from 1 to 0 in the second) used to assess the enrichment of the desired DEGs (differentially expressed gene) when ranking using limma's p-value, GEOlimma's adjusted B-value and GEOreflect via Gene Set Enrichment Analysis  (GSEA)- available at https://www.gsea-msigdb.org/gsea/index.jsp. 1 represents a gene with the lowest rank- highest priority. 

The .rnk files used by the GSEA software (https://www.gsea-msigdb.org/gsea/index.jsp) to measure the enrichment of gene sets in a ranked list. The software uses the tab delaminated .rnk files to calculate the GSEA enrichment score and nominal p-value measuring the significance of associating a gene set with the ranked gene list. The names of the files refer to the GEO id for the dataset and the method used to rank the differentially expressed genes.

Machine learning inputs .csv files- [The genomic and transcript based features used to predict a gene's prior in the machine learning models. First row- gene is the HNGC identifier for the genes whilst the min_to_be_sig column represents a gene's CDF value at 0.05 for their p-value distribution obtained from the RNA-seq datasets i.e. the target y for the regressor model. The sd column represents the standard deviation for an assigned probe. Relevant for the GPL570 derived priors wherein there are are redundant probes targeting the same gene in different regions resulting in multiple priors assigned to the same gene.

This dataset contains:  The .csv comma delaminated file. First column is the HNGC symbol for a given gene. The penultimate column (min_to_be_sig) is the prior assigned to the gene and the value the machine learning models were aiming to predict. The sd column is only relevant for the GPL570 derived priors and not for these- the sd column This is because the GPL570 derived priors can have multiple priors assigned to the same gene due to redundant probes targeting the same gene in different regions.

scRNA-seq benchmark outputs- The csv files containing the normalised mutual index, adjusted rand index and Silhouette coefficient obtained when using 5 single cell RNA-sequencing techniques- GEOreflect, Seurat's vst method, CellBRF, genebasis and CellBRF with the 3 sigma rule imposed. These datasets use their GEO identifier in the file name or for Zheng et al's data from genomics 10X the name assigned to it via the DuoClustering2018 R package (https://www.bioconductor.org/packages/release/data/experiment/html/DuoClustering2018.html).

These files contain the .csv comma delaminated files of the outputs after measuring the ability of the 6 feature selection methods to select the most informative genes based on their ability to separate the clusters measuring three metrics- Silhouette coefficeints, adjusted rand index and normalised mutual index.


Methods for processing the data: The .rnk files used by the GSEA software (https://www.gsea-msigdb.org/gsea/index.jsp) to measure the enrichment of gene sets in a ranked list. The software uses the tab delaminated .rnk files to calculate the GSEA enrichment score and nominal p-value measuring the significance of associating a gene set with the ranked gene list. The names of the files refer to the GEO id for the dataset and the method used to rank the differentially expressed genes.

Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers:

Standards and calibration information, if appropriate:

Environmental/experimental conditions:

Describe any quality-assurance procedures performed on the data:

People involved with sample collection, processing, analysis and/or submission:


--------------------------
DATA-SPECIFIC INFORMATION 
--------------------------

Number of variables:

Number of cases/rows: . For the SQLite database a total of 947 from the  Affymetrix Human Genome U133 array (GPL570) platform, 200 from GPL10558 (Illumina HumanHT-12) platform, 78 datasets from the Affymetrix Human Genome U95 Version 2 Array 
(GPL8300) and 232 from the GPL6480 (Agilent Whole Human Genome Microarray) platforms were used.

Machine learning outputs contain 31 transcriptomic and genomic features for 19,180 genes within the human genome and used to build the regressor and classifier models

Variable list, defining any abbreviations, units of measure, codes or symbols used: None- not applicable
   
Missing data codes: NA

Specialized formats or other abbreviations used: .sqlite3-



Date that the file was created: Month, Year