Raw Datasets
Investigation into the raw data of the solubility challenge expanded datasets.
4 data sources, obtained by Adam Atkins: • Delaney – 1144 compounds from Syngenta paper http://pubs.acs.org/doi/abs/10.1021/ci034243x • Huuskonen – 1312 compounds in three separate sets from http://pubs.acs.org/doi/pdf/10.1021/ci9901338 obtained from AQUASOL database and SCR’s PHYSPROP database. • Pubchem – 57859 compounds obtained from Pubchem through the API https://pubchem.ncbi.nlm.nih.gov/bioassay/1996 (Aqueous solubilty from MLSMR stock solutions) produced by Burnham Centre for Chemical Genomics. • Solubility_challenge – 132 compounds (141 compound forms) obtained from the Pfizer Institute for Pharmaceutical Materials Science & Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge http://pubs.acs.org/doi/pdf/10.1021/ci800058v , http://www-jmg.ch.cam.ac.uk/data/solubility/
Total number of compounds in combined dataset : 60456 Not all of these compounds were taken through to the descriptor stage. Some were rejected by Dragon due to incorrectly depicted structures. Of those that did have descriptors calculated, some of them do not have solubility measurements.
Compounds outputted by Dragon: 58562 Compounds rejected by Dragon: 1894
Compounds without solubility measurements (from total compounds): uncalculated
Compounds without solubility measurements (from outputted compounds): 12
Import the solubility values, these are the compounds output by Dragon, with only their MW descriptor and the solubility value (converted to LogS for all compounds) some of these currently are not rounded, but contain many decimal places due to the calculation carried out on them. These should be rounded as they are not known to many decimal places.
Distribution of data values
#histogram of the LogS values
hist(solubility_output_MW_only$LogS.M.)

#histogram of the MW values
hist(solubility_output_MW_only$MW)

What is the peak at approx -3.5 why would so many compounds be contained within that group, but such a sharp drop in the neighbouring category.
In order to produce a density plot, the dataset can only contain complete columns. Therefore we need to remove lines which do not have LogS values in the dataset. This leaves us with 58550 compounds in the datset.
#density plots
solubility_output_LogS_MW_complete <- na.omit(solubility_output_LogS_MW)
#58550 compounds
d<- density(solubility_output_LogS_MW_complete$LogS)
plot(d)

#kernel density plot for the data. Large peak at approx -4
#add some descriptions to this plot
Kernel density plot for the dataset, only complete cases. Plots the distribution of the LogS variable. Large peak at approx -4, what could be casusing this. This is caused by the PubChem dataset as that is the largest component of the data. How does the data distribution differ for the other sets?
PubChem subset - 55971 compounds

Delaney subset - 1144 compounds
Delaney <- subset(solubility_output_LogS_MW_complete, Source == "Delaney")
plot(dDelaney)

Huuskonen subset - 1305 compounds
Huuskonen <- subset(solubility_output_LogS_MW_complete, Source == "Huuskonen")
dHuuskonen <- density(Huuskonen$LogS)
plot(dHuuskonen)

Solubility Challenge subset - 130 compounds
Solubility_Challenge <- subset(solubility_output_LogS_MW_complete, Source == "Solubility Challenge")
dSolubility_Challenge <- density(Solubility_Challenge$LogS)
plot(dSolubility_Challenge)

Compare the distributions of the various sets.
library(sm)
attach(solubility_output_LogS_MW_complete)
The following objects are masked from solubility_output_LogS_MW_complete (pos = 4):
Compound_Identifier, LogS, MW, Source
The following objects are masked from solubility_output_LogS_MW_complete (pos = 5):
Compound_Identifier, LogS, MW, Source
#create value labels
Source.f <- factor(Source, levels=c("Delaney", "Huuskonen", "PubChem", "Solubility Challenge"), labels = c("Delaney", "Huuskonen", "PubChem", "Solubility Challenge"))
#plot densities
sm.density.compare(LogS, Source, xlab="LogS")
# add legend
colfill<-c(2:(2+length(levels(Source.f))))
legend("topright", levels(Source.f), fill=colfill)

detach(solubility_output_LogS_MW_complete)
Delaney and Huuskonen seems very similar distributions. PubChem has a lower peak than the other datasets, but much sharper. Is this simply caused by the dataset being significantly larger than the others?
Solubility values
33 compounds with a LogS value > 1 All 33 compounds are contained within the Delaney/Huuskonen dataset.
PubChem missing column
Re-examining the raw datafile obtained from the API suggests that an important column was missing from the API/script file downloaded for the PubChem assay. The assay actually has 3 possible qualified results, >, < or = to a value in the neighbouring column. For those compounds which have a = the value can simply be used. However those compounds containing < or > qualifiers, the values cannot simply be used. They should instead be excluded.
The comments on the assay suggests that there is an experimental threshold above which the method contain obtain accurate readings. possibly 75% of the 200micromolar compound load. Conversion of 150micromolar to LogS/M gives a value of -3.824 this seems to correspond to the position of the peak in the LogS distribution and may be the cause of it.
In order to identify the compounds that are < or > instead of = the assay dataset has been re-obtained with the additional column, however this does not contain the structural information columns, so the two datasets must be merged together.
The datasets have been merged and subsetted out into 3 different subsets, those containing > as the qualifier (16695 compounds), those containing < as the qualifier (3036 compounds) and those containing = as the qualifier (36420 compounds)
hist(PubChem_GT$LogS)

hist(PubChem_LT$LogS)

hist(PubChem_Equals$LogS)

There still appears to be a peak at approx -3.8 on the PubChem_equals subset. However this is reduced from the previous peak.
library(sm)
attach(PubChem_qualified_outputs)
The following objects are masked from solubility_output_LogS_MW_complete (pos = 4):
Compound_Identifier, LogS, MW, Source
The following objects are masked from solubility_output_LogS_MW_complete (pos = 5):
Compound_Identifier, LogS, MW, Source
#create value labels
Qualifier.for.solubility.f <- factor(Qualifier.for.solubility, levels=c("LT", "Equals", "GT"), labels = c("LT", "Equals", "GT"))
#plot densities
sm.density.compare(LogS, Qualifier.for.solubility, xlab="LogS")
# add legend
colfill<-c(2:(2+length(levels(Qualifier.f))))
legend("topright", levels(Qualifier.f), fill=colfill)

detach(PubChem_qualified_outputs)
This compare plot appears to have the different subsets incorrectly labelled. Additionally the densities relative to each other appear incorrect. They also appear different to the individual plots for the equals group (marked as the GT group here)
Plotting densities for the 3 subsets seperately gives the correct positioning for the curves, however the densities for the GT subset go to approx 1500 but the LT subset only goes to approx 2.5 and the equals subset approx 1.5. Look into how the density is displayed as the equals subset has over twice the number of compounds that the GT subset contains.
dPubChem_GT <-density(PubChem_GT$LogS)
plot(dPubChem_GT)

dPubChem_LT <-density(PubChem_LT$LogS)
plot(dPubChem_LT)

dPubChem_Equals <-density(PubChem_Equals$LogS)
plot(dPubChem_Equals)

---
title: "Solubility challenge examination of data"
output: html_notebook
---
#Raw Datasets

Investigation into the raw data of the solubility challenge expanded datasets. 

4 data sources, obtained by Adam Atkins: 
•	Delaney – 1144 compounds from Syngenta paper http://pubs.acs.org/doi/abs/10.1021/ci034243x
•	Huuskonen – 1312 compounds in three separate sets from http://pubs.acs.org/doi/pdf/10.1021/ci9901338 obtained from AQUASOL database and SCR’s PHYSPROP database. 
•	Pubchem – 57859 compounds obtained from Pubchem through the API https://pubchem.ncbi.nlm.nih.gov/bioassay/1996 (Aqueous solubilty from MLSMR stock solutions) produced by Burnham Centre for Chemical Genomics. 
•	Solubility_challenge – 132 compounds  (141 compound forms) obtained from the Pfizer Institute for Pharmaceutical Materials Science & Unilever  Centre for Molecular Informatics, Department of Chemistry, University of Cambridge
http://pubs.acs.org/doi/pdf/10.1021/ci800058v , http://www-jmg.ch.cam.ac.uk/data/solubility/ 


Total number of compounds in combined dataset : 60456
	 Not all of these compounds were taken through to the descriptor stage. Some were rejected by Dragon due to incorrectly depicted structures. Of those that did have descriptors calculated, some of them do not have solubility measurements.
	 
Compounds outputted by Dragon: 58562
Compounds rejected by Dragon: 1894

Compounds without solubility measurements (from total compounds): uncalculated  
Compounds without solubility measurements (from outputted compounds): 12

Import the solubility values, these are the compounds output by Dragon, with only their MW descriptor and the solubility value (converted to LogS for all compounds) some of these currently are not rounded, but contain many decimal places due to the calculation carried out on them. These should be rounded as they are not known to many decimal places.

```{r}
#import the solubility values 
solubility_output_MW_only <- read.csv("C:\\Users\\nk1g09\\Dropbox\\Solubility_challenge\\R_project\\solubility_output_MW_only.csv")

#select the columns
#make into new dataframe
solubility_output_LogS_MW<- data.frame(Compound_Identifier = solubility_output_MW_only$Compound_Identifier, LogS = solubility_output_MW_only$LogS.M., MW = solubility_output_MW_only$MW, Source = solubility_output_MW_only$Source)
#is there a better way to do this? Subsetting?
#58562 compounds
```

#Distribution of data values
```{r}

#histogram of the LogS values
hist(solubility_output_MW_only$LogS.M.)

#histogram of the MW values
hist(solubility_output_MW_only$MW)
```

What is the peak at approx -3.5 why would so many compounds be contained within that group, but such a sharp drop in the neighbouring category.  

In order to produce a density plot, the dataset can only contain complete columns. Therefore we need to remove lines which do not have LogS values in the dataset. This leaves us with 58550 compounds in the datset. 
```{r}
#density plots
solubility_output_LogS_MW_complete <- na.omit(solubility_output_LogS_MW)
#58550 compounds

d<- density(solubility_output_LogS_MW_complete$LogS)
plot(d)
#kernel density plot for the data. Large peak at approx -4 
#add some descriptions to this plot
```
Kernel density plot for the dataset, only complete cases. Plots the distribution of the LogS variable. Large peak at approx -4, what could be casusing this. This is caused by the PubChem dataset as that is the largest component of the data. How does the data distribution differ for the other sets?

PubChem subset - 55971 compounds
```{r}
PubChem <- subset(solubility_output_LogS_MW_complete, Source == "PubChem")
dPubChem <- density(PubChem$LogS)
plot(dPubChem)
```
Delaney subset - 1144 compounds
```{r}
Delaney <- subset(solubility_output_LogS_MW_complete, Source == "Delaney")
dDelaney <- density(Delaney$LogS)
plot(dDelaney)
```
Huuskonen subset - 1305 compounds
```{r}
Huuskonen <- subset(solubility_output_LogS_MW_complete, Source == "Huuskonen")
dHuuskonen <- density(Huuskonen$LogS)
plot(dHuuskonen)
```
Solubility Challenge subset - 130 compounds
```{r}
Solubility_Challenge <- subset(solubility_output_LogS_MW_complete, Source == "Solubility Challenge")
dSolubility_Challenge <- density(Solubility_Challenge$LogS)
plot(dSolubility_Challenge)
```

Compare the distributions of the various sets. 
```{r}
library(sm)
attach(solubility_output_LogS_MW_complete)

#create value labels
Source.f <- factor(Source, levels=c("Delaney", "Huuskonen", "PubChem", "Solubility Challenge"), labels = c("Delaney", "Huuskonen", "PubChem", "Solubility Challenge"))

#plot densities
sm.density.compare(LogS, Source, xlab="LogS")

# add legend
colfill<-c(2:(2+length(levels(Source.f)))) 
legend("topright", levels(Source.f), fill=colfill)

detach(solubility_output_LogS_MW_complete)
```
Delaney and Huuskonen seems very similar distributions. 
PubChem has a lower peak than the other datasets, but much sharper. Is this simply caused by the dataset being significantly larger than the others?

#Solubility values
33 compounds with a LogS value > 1 
All 33 compounds are contained within the Delaney/Huuskonen dataset.

#PubChem missing column
Re-examining the raw datafile obtained from the API suggests that an important column was missing from the API/script file downloaded for the PubChem assay. 
The assay actually has 3 possible qualified results, >, < or = to a value in the neighbouring column. 
For those compounds which have a = the value can simply be used. However those compounds containing < or > qualifiers, the values cannot simply be used. They should instead be excluded. 

The comments on the assay suggests that there is an experimental threshold above which the method contain obtain accurate readings. possibly 75% of the 200micromolar compound load. Conversion of 150micromolar to LogS/M gives a value of -3.824 this seems to correspond to the position of the peak in the LogS distribution and may be the cause of it. 

In order to identify the compounds that are < or > instead of = the assay dataset has been re-obtained with the additional column, however this does not contain the structural information columns, so the two datasets must be merged together. 

```{r}
PubChem_qualifiers<-read.csv("C:\\Users\\nk1g09\\Dropbox\\Solubility_challenge\\Expanded_datasets\\data_from_adam\\PubChem_AID_1996\\AID_1996_qualifiers.csv")
#read in the downloaded assay file - this contains all 57859 compound entries. 

#merge the new download with the PubChem output file (containing solubility as LogS/M and MW) This should be joined by the Compound_Identifier column as these are the same between the two datasets, PU_+the PubChem_CID 
PubChem_qualified_outputs<- merge(PubChem, PubChem_qualifiers, by="Compound_Identifier" )

PubChem_GT<-subset(PubChem_qualified_outputs, Qualifier.for.solubility== "GT")
PubChem_LT<-subset(PubChem_qualified_outputs, Qualifier.for.solubility== "LT")
PubChem_Equals<-subset(PubChem_qualified_outputs, Qualifier.for.solubility== "Equals")

```

The datasets have been merged and subsetted out into 3 different subsets, those containing > as the qualifier (16695 compounds), those containing < as the qualifier (3036 compounds) and those containing = as the qualifier (36420 compounds)

```{r}
hist(PubChem_GT$LogS)
hist(PubChem_LT$LogS)
hist(PubChem_Equals$LogS)
```

There still appears to be a peak at approx -3.8 on the PubChem_equals subset. However this is reduced from the previous peak. 

```{r}
library(sm)
attach(PubChem_qualified_outputs)

#create value labels
Qualifier.for.solubility.f <- factor(Qualifier.for.solubility, levels=c("LT", "Equals", "GT"), labels = c("LT", "Equals", "GT"))

#plot densities
sm.density.compare(LogS, Qualifier.for.solubility, xlab="LogS")

# add legend
colfill<-c(2:(2+length(levels(Qualifier.f)))) 
legend("topright", levels(Qualifier.f), fill=colfill)

detach(PubChem_qualified_outputs)
```

This compare plot appears to have the different subsets incorrectly labelled. Additionally the densities relative to each other appear incorrect. They also appear different to the individual plots for the equals group (marked as the GT group here)

Plotting densities for the 3 subsets seperately gives the correct positioning for the curves, however the densities for the GT subset go to approx 1500 but the LT subset only goes to approx 2.5 and the equals subset approx 1.5. Look into how the density is displayed as the equals subset has over twice the number of compounds that the GT subset contains. 

```{r}
dPubChem_GT <-density(PubChem_GT$LogS)
plot(dPubChem_GT)

dPubChem_LT <-density(PubChem_LT$LogS)
plot(dPubChem_LT)

dPubChem_Equals <-density(PubChem_Equals$LogS)
plot(dPubChem_Equals)
```

