To generate 3D co-ordinates for the Gale group compounds. Regenerate DRAGON descriptors (2D and 3D)
Use descriptors to carry out PCA (Principal component analysis) and possibly PLS (partial least squares) - which is a supervised version of PCA. Examine if the PCA output can be clustered in any form - how do these clusters/ distribution change between the 2D and 3D descriptors?
131 compounds in file, does contain some duplicates
This operation (invoked using –gen3d at the commandline) generates 3D structures for 0D or 2D structures using the following series of steps:
Tried single molecule (101002_anie201200729-1.mol) - used Gen3D in OpenBabel. 3D co-ordinates were generated and saved to a new mol file. Also can see a comparison of the 2D and 3D structures.
When trying to process multiple molecules at one go - try sdf output format. (append input filename to title, generate 3D co-ordinates)
Appears to correctly generate multi structures in a single sdf file output - however the output appears to number 9 - 1 rather than 1 - 9 as OpenBabel pulls them in in reverse order.
All original PAG compounds selected (131 compounds) and 3D structures generated - output sdf format (3D_Structures_PAG_Originalset.sdf) - appears to give slight differences in the structures each time it is run - how would this affect the descriptors and PCA analysis etc? Second generated file - 3D_Structures_PAG_Originalset2.sdf is the same compounds.
Some of these are duplicates, would it be beneficial to remove all the duplicates before the generation of new descriptors?
Output: 3D_Structures_PAG_Originalset_output1
No of descriptors: 4870? (all 2D & 3D - excluding Charge descriptors)
Exclude descriptors with constant values
Exclude descriptors with at least one missing value
3100 descriptors exported
Output: 3D_Structures_PAG_Originalset_output2
No of descriptors: 4870? (all 2D & 3D - excluding Charge descriptors)
Exclude descriptors with constant values
Exclude descriptors with constant and near-constant values
Exclude descriptors with stanard deviation < 0.0001
Exclude descriptors with at least one missing value
2996 descriptors exported
Output: 3D_Structures_PAG_Originalset2_output1
No of descriptors: 4870 (all 2D & 3D - excluding Charge descriptors)
Exclude descriptors with constant values
Exclude descriptors with at least one missing value
3100 descriptors exported
Output: 3D_Structures_PAG_Originalset2_output2
No of descriptors: 4870 (all 2D & 3D - excluding Charge descriptors)
Exclude descriptors with constant values
Exclude descriptors with constant and near-constant values
Exclude descriptors with stanard deviation < 0.0001
Exclude descriptors with at least one missing value
2996 descriptors exported
Same number of descriptors for the 1st and 2nd sets of molecules.
Output: 2D_Structures_PAG_Originalset_output1
No of descriptors: 4870 (all 2D & 3D - excluding Charge descriptors)
Exclude descriptors with constant values
Exclude descriptors with at least one missing value
2013 descriptors exported
Output: 2D_Structures_PAG_Originalset_output2
No of descriptors: 4870 (all 2D & 3D - excluding Charge descriptors)
Exclude descriptors with constant values
Exclude descriptors with constant and near-constant values
Exclude descriptors with stanard deviation < 0.0001
Exclude descriptors with at least one missing value
1914 descriptors exported ***
PCA is pricinipal components analysis - this is a dimensionality reduction technique that is widely used in data analysis.
https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
Some general advice - from https://tgmstat.wordpress.com/2013/11/21/introduction-to-principal-component-analysis-pca/
PCA is not scale invariant, so it is highly recommended to standardize all the {p} variables before applying PCA. Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice. How many principal components to retain will depend on the specific application. Plotting {(1-R^2)} versus the number of components can be useful to visualize the number of principal components that retain most of the variability contained in the original data. Two or three principal components can be used for visualization purposes.
PCA is an unsupervised technique - the response variable is not used to determine the direction of the component. Principal components are orthogonal.
These packages may be required to be loaded in order for the system to function correctly.
r
r library() #otherwise ggplot2 errors - required reinstallation
package 㤼㸱lazyeval㤼㸲 was built under R version 3.4.4
r
r library(2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.4.4
r
r library() #to work with ggplot2? library() #to work with ggplot2? library() #this is a package that is removed from CRAN - had to be downloaded from the archive - requires tidyr and gridExtra to be installed as well library()
Loading required package: magrittr
package 㤼㸱magrittr㤼㸲 was built under R version 3.4.4
r
r library()
Attaching package: 㤼㸱dplyr㤼㸲
The following object is masked from 㤼㸱package:gridExtra㤼㸲:
combine
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
r
r library()
package 㤼㸱yaml㤼㸲 was built under R version 3.4.4
DRAGON output files were converted to .csv file format. The number field (first column) was removed.
`3D_Structures_PAG_Originalset_output2`<- read.table("C:/Users/nk1g09/Dropbox/PAG_group/PCA/DRAGON/3D_Structures_PAG_Originalset_output2.csv", header=TRUE, sep=",", row.names="NAME")
# this brings in the csv file containing descriptors for the 3D structures as a dataframe
`2D_Structures_PAG_Originalset_output2`<- read.table("C:/Users/nk1g09/Dropbox/PAG_group/PCA/DRAGON/2D_Structures_PAG_Originalset_output2.csv", header=TRUE, sep=",", row.names="NAME")
# this brings in the csv file containing descriptors for the 2D structures as a dataframe
To colour the PCA plots by EC50 or compounds groupings it is necessary to merge the DRAGON descriptors with those compound identifiers which are not generated through DRAGON. This includes the experimental descriptors and identifiers such as compound names and molecular formulas.
Compound_Identifiers_classification<-read.table("C:/Users/nk1g09/Dropbox/PAG_group/PCA/PAG_compounds_chemical_values_duplicates.csv", header=TRUE, sep=",", row.names="NAME")
#brings in the compound identifers and experimental descriptors.
joined_3D_PAG_Originalset_output2 <- merge(Compound_Identifiers_classification, `3D_Structures_PAG_Originalset_output2`, by='row.names')
#merges the two datasets by row-names (which are the compound filenames), this gives a 131 by 3007 dataframe
joined_2D_PAG_Originalset_output2 <- merge(Compound_Identifiers_classification, `2D_Structures_PAG_Originalset_output2`, by='row.names')
The PCA can now be run on the DRAGON descriptors, and coloured by various different options.
#need to run the PCA on all columns after the 11th column?
'3D_PAG_Original_output2_PCA2'<- prcomp(joined_3D_PAG_Originalset_output2[,12:3007])
plot(`3D_PAG_Original_output2_PCA2`$x[,1:2])
plot(`3D_PAG_Original_output2_PCA2`$x[,1:2], col=joined_3D_PAG_Originalset_output2$EC50_Cl.NO3.)
legend("topright", pch=20, col=unique(joined_3D_PAG_Originalset_output2$EC50_Cl.NO3.), legend=unique(joined_3D_PAG_Originalset_output2$EC50_Cl.NO3.))
#34 compounds are missing EC50 values, but there seems to be even more missing from this plot, doesn't plot correctly as there are too many values? need to group the EC50 values into bands. in some fashion.
plot(`3D_PAG_Original_output2_PCA2`$x[,1:2], col=joined_3D_PAG_Originalset_output2$Compound_Type, pch=20)
legend("topright", pch=20, col=unique(joined_3D_PAG_Originalset_output2$Compound_Type), legend=unique(joined_3D_PAG_Originalset_output2$Compound_Type))
What about if it is taken by Subtype instead of Type. If there are too many groups then the colour function cannot handle it properly and the colours are repeated. If using >8 groups for the colours then repeated colours will appear. Possibly able to use a different package to generate more colours, or could it use different symbols?
plot(`3D_PAG_Original_output2_PCA2`$x[,1:2], col=joined_3D_PAG_Originalset_output2$Compound_subtype, pch=20)
legend("topright", pch=20, col=unique(joined_3D_PAG_Originalset_output2$Compound_subtype), legend=unique(joined_3D_PAG_Originalset_output2$Compound_subtype))
#too many groups for the colours to work properly.
Need to try the PCA with scaling included in it and see how this changes the plot. in addition how the 2D and 3D plots differ.
#using the scale function to scale the data prior to PCA
#need to run the PCA on all columns after the 11th column?
'3D_PAG_Original_output2_PCA3'<- prcomp(scale(joined_3D_PAG_Originalset_output2[,12:3007]))
plot(`3D_PAG_Original_output2_PCA3`$x[,1:2], col=joined_3D_PAG_Originalset_output2$Compound_Type, pch=20)
legend("topright", pch=20, col=unique(joined_3D_PAG_Originalset_output2$Compound_Type), legend=unique(joined_3D_PAG_Originalset_output2$Compound_Type))
How does the scale() function compare to using scale. in the prcomp function?
#using scale. = TRUE to scale the data for PCA
#need to run the PCA on all columns after the 11th column?
'3D_PAG_Original_output2_PCA4'<- prcomp(joined_3D_PAG_Originalset_output2[,12:3007], scale. = TRUE)
plot(`3D_PAG_Original_output2_PCA4`$x[,1:2], col=joined_3D_PAG_Originalset_output2$Compound_Type, pch=20)
legend("topright", pch=20, col=unique(joined_3D_PAG_Originalset_output2$Compound_Type), legend=unique(joined_3D_PAG_Originalset_output2$Compound_Type))
The two outputs seem to generate the same plot for the PCA.
Follow the same structure for generating the 3D PCA plots, but for the 2D instead so they can be compared. This just used the Compound_Type as colouring as neither Ec50 or subtypes currently colour correctly.
'2D_PAG_Original_output2_PCA'<- prcomp(joined_2D_PAG_Originalset_output2[,12:1925], scale. = TRUE)
plot(`2D_PAG_Original_output2_PCA`$x[,1:2])
plot(`2D_PAG_Original_output2_PCA`$x[,1:2], col=joined_2D_PAG_Originalset_output2$Compound_Type, pch=20)
plot(`2D_PAG_Original_output2_PCA`$x[,1:2], col=joined_2D_PAG_Originalset_output2$Compound_Type, pch=20)
legend("topright", pch=20, col=unique(joined_2D_PAG_Originalset_output2$Compound_Type), legend=unique(joined_2D_PAG_Originalset_output2$Compound_Type), cex=0.75)
#legend was too large
The legends on the existing plots are too large/misplaced so they cover up data that is plotted on the graph.
It is possible to move the legend when using the plot() function, using par() to create an area for the legend.
Using ggplot2 may be a better option if the package can be installed easily.
Comparing plots: multiple plots can be displayed in a single area using the par() function or layout() function creating a plot with multiple subplots.
If using ggplots how would it work to produce multiple plots in a single environment? and how does it get the colouration?
#join PCA output and compound types
PC3Di<-data.frame(`3D_PAG_Original_output2_PCA4`$x,Compound_Type=joined_3D_PAG_Originalset_output2$Compound_Type) # scaled PCA for the 3D structures, combined with the compound type column
PC3Dii<-data.frame(`3D_PAG_Original_output2_PCA4`$x,Compound_Type=joined_3D_PAG_Originalset_output2$Compound_Type, EC50=joined_3D_PAG_Originalset_output2$EC50_Cl.NO3.) #scaled PCA for the 3D structures, combined with the compound type and EC50 value columns
PC3Diii<-data.frame(`3D_PAG_Original_output2_PCA4`$x,Compound_Type=joined_3D_PAG_Originalset_output2$Compound_Type, Log.1.EC50.=joined_3D_PAG_Originalset_output2$Log.1.EC50.) #scaled PCA for the 3D structures, combined with the compound type and log EC50 values
#quick plot of the data
qplot(x=PC1, y=PC2, data=PC3Di, color=Compound_Type)
qplot(x=PC1, y=PC2, data=PC3Dii, color=Compound_Type, size=EC50) # do I need to scale the EC50 values?
#if EC50 equals NA then it should be some different marking, otherwise some form of scale of binned sizes?
qplot(x=PC1, y=PC2, data=PC3Diii, color=Compound_Type, size=Log.1.EC50.) #logec50 values
Check the EC50 values - are these sized correctly? LogEC50 values aren’t very clear for differentiation
Could use autoplot through ggfortify (now outdated?) But this would require extra modification to use scaling and colours from the compound types.
colour the plots for the 2D PCA - by compound type and do size by EC50
#join PCA output and compound types
PC2Di<-data.frame(`2D_PAG_Original_output2_PCA`$x,Compound_Type=joined_2D_PAG_Originalset_output2$Compound_Type) # scaled PCA for the 2D structures, combined with the compound type column
PC2Dii<-data.frame(`2D_PAG_Original_output2_PCA`$x,Compound_Type=joined_2D_PAG_Originalset_output2$Compound_Type, EC50=joined_2D_PAG_Originalset_output2$EC50_Cl.NO3.) #scaled PCA for the 2D structures, combined with the compound type and EC50 value columns
PC2Diii<-data.frame(`2D_PAG_Original_output2_PCA`$x,Compound_Type=joined_2D_PAG_Originalset_output2$Compound_Type, Log.1.EC50.=joined_2D_PAG_Originalset_output2$Log.1.EC50.) #scaled PCA for the 2D structures, combined with compound type and logEC50 values
#quick plot of the data
qplot(x=PC1, y=PC2, data=PC2Di, color=Compound_Type)
qplot(x=PC1, y=PC2, data=PC2Dii, color=Compound_Type, size=EC50)
qplot(x=PC1, y=PC2, data=PC2Diii, color=Compound_Type, size=Log.1.EC50.)
qplot(x=PC1, y=PC2, data=PC2Diii, color=Log.1.EC50., pch = Compound_Type) # not very easy to differentiate
Want a comparison of the 2D PCA and the 3D PCA. Attempting to get two plots to display on the same image. This cannot be done using pa or layout for ggplots however this may be feasible through the use of ggpubr package.
plot2D<- ggplot(PC3Di, aes(x=PC1, y=PC2, color=Compound_Type)) +
geom_point() + ggtitle("PCA of Anion transporter compounds - 3D descriptors")
plot3D<- ggplot(PC2Di, aes(x=PC1, y=PC2, color=Compound_Type)) +
geom_point() + ggtitle("PCA of Anion transporter compounds - 2D descriptors")
par(mfrow=c(2,1))
ggplot(PC3Di, aes(x=PC1, y=PC2, color=Compound_Type)) +
geom_point() + ggtitle("PCA of Anion transporter compounds - 3D descriptors")
ggplot(PC2Di, aes(x=PC1, y=PC2, color=Compound_Type)) +
geom_point() + ggtitle("PCA of Anion transporter compounds - 2D descriptors")
#how to get two ggplot plots to plot together on a single plot?
#can put two normal plots together on a single plot, but what about the ggplots.
#plot(PC3Di$PC1, PC3Di$PC2, col=PC3Di$Compound_Type)
#plot(PC2Di$PC1, PC2Di$PC2, col=PC2Di$Compound_Type)
ggpubr function can be used to arrange multiple plots on a single layout and create extra text, annotations etc. Installation of ggpubr installs gridExtra and cowplot which give extra functionality.
#requires ggpubr package
plot3D<- ggplot(PC3Di, aes(x=PC1, y=PC2, color=Compound_Type)) +
geom_point() + ggtitle("PCA of Anion transporter compounds \n 3D descriptors") +theme(plot.title = element_text(size=10, hjust=0.5))
plot2D<- ggplot(PC2Di, aes(x=PC1, y=PC2, color=Compound_Type)) +
geom_point() + ggtitle("PCA of Anion transporter compounds \n 2D descriptors") +theme(plot.title = element_text(size=10, hjust=0.5))
comparison_plot<- ggarrange(plot2D, plot3D,
labels = c("A", "B"), common.legend=TRUE, legend="bottom" , label.x = 0, label.y = 0, hjust=-0.5, vjust=-0.2)
comparison_plot
ggsave("C:/Users/nk1g09/Dropbox/PAG_group/PCA/R_coding/PCA_Comparison.png", width=6, height=4)
Comparing the 3D and 2D scaled PCA does not produce a large difference in the presence of clusters.
Does the clustering change if it is coloured by Log(1/EC50) instead?
plot3D_2<- ggplot(PC3Diii, aes(x=PC1, y=PC2, color=Log.1.EC50.)) +
geom_point() + ggtitle("PCA of Anion transporter compounds \n 3D descriptors") +theme(plot.title = element_text(size=10, hjust=0.5))
#this colouration is the straight LogEC50 (not Log(1/EC50)) - it also appears to be the ln and not base 10 log. need to check if this is correct.
plot2D_2<- ggplot(PC2Diii, aes(x=PC1, y=PC2, color=Log.1.EC50.)) +
geom_point() + ggtitle("PCA of Anion transporter compounds \n 2D descriptors") +theme(plot.title = element_text(size=10, hjust=0.5))
comparison_plot_2<- ggarrange(plot2D_2, plot3D_2,
labels = c("A", "B"), common.legend=TRUE, legend="bottom" , label.x = 0, label.y = 0, hjust=-0.5, vjust=-0.2)
comparison_plot_2
ggsave("C:/Users/nk1g09/Dropbox/PAG_group/PCA/R_coding/PCA_Comparison_2.png", width=6, height=4)
How much variance do the first two PC values account for in the dataset?
Determining the proportion of variance that is contained within the first 2 PCs for the 2D and 3D descriptors.
The 3D scaled PCA is - 3D_PAG_Original_output2_PCA4 the 2D scaled PCA is - 2D_PAG_Original_output2_PCA
summary(`3D_PAG_Original_output2_PCA4`)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
Standard deviation 37.7003 18.3958 14.70963 10.07133 9.83226 8.69800 8.301 7.79434 6.81213 6.64735 6.4309 5.76033 5.40156 5.29806 4.79419 4.6444
Proportion of Variance 0.4744 0.1129 0.07222 0.03386 0.03227 0.02525 0.023 0.02028 0.01549 0.01475 0.0138 0.01108 0.00974 0.00937 0.00767 0.0072
Cumulative Proportion 0.4744 0.5874 0.65958 0.69343 0.72570 0.75095 0.774 0.79423 0.80972 0.82447 0.8383 0.84935 0.85908 0.86845 0.87612 0.8833
PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32
Standard deviation 4.32512 4.25001 4.03990 3.89068 3.72504 3.65501 3.45202 3.36349 3.27503 3.24721 3.12952 3.02677 2.99266 2.8468 2.69681 2.66057
Proportion of Variance 0.00624 0.00603 0.00545 0.00505 0.00463 0.00446 0.00398 0.00378 0.00358 0.00352 0.00327 0.00306 0.00299 0.0027 0.00243 0.00236
Cumulative Proportion 0.88957 0.89560 0.90104 0.90610 0.91073 0.91519 0.91917 0.92294 0.92652 0.93004 0.93331 0.93637 0.93936 0.9421 0.94449 0.94685
PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48
Standard deviation 2.60026 2.54154 2.3870 2.36021 2.30247 2.25027 2.22005 2.20094 2.14324 2.13068 2.05753 1.9757 1.94799 1.86588 1.84947 1.81973
Proportion of Variance 0.00226 0.00216 0.0019 0.00186 0.00177 0.00169 0.00165 0.00162 0.00153 0.00152 0.00141 0.0013 0.00127 0.00116 0.00114 0.00111
Cumulative Proportion 0.94911 0.95127 0.9532 0.95503 0.95680 0.95849 0.96013 0.96175 0.96328 0.96480 0.96621 0.9675 0.96878 0.96994 0.97108 0.97219
PC49 PC50 PC51 PC52 PC53 PC54 PC55 PC56 PC57 PC58 PC59 PC60 PC61 PC62 PC63 PC64
Standard deviation 1.77329 1.7271 1.70627 1.68954 1.65833 1.62603 1.56789 1.54263 1.51130 1.49830 1.46513 1.4517 1.40944 1.38904 1.35564 1.33189
Proportion of Variance 0.00105 0.0010 0.00097 0.00095 0.00092 0.00088 0.00082 0.00079 0.00076 0.00075 0.00072 0.0007 0.00066 0.00064 0.00061 0.00059
Cumulative Proportion 0.97324 0.9742 0.97520 0.97616 0.97708 0.97796 0.97878 0.97957 0.98034 0.98108 0.98180 0.9825 0.98317 0.98381 0.98442 0.98502
PC65 PC66 PC67 PC68 PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76 PC77 PC78 PC79 PC80
Standard deviation 1.31853 1.30024 1.27195 1.23138 1.21421 1.20425 1.18943 1.15265 1.14850 1.12618 1.11179 1.0900 1.08026 1.05284 1.04709 1.01764
Proportion of Variance 0.00058 0.00056 0.00054 0.00051 0.00049 0.00048 0.00047 0.00044 0.00044 0.00042 0.00041 0.0004 0.00039 0.00037 0.00037 0.00035
Cumulative Proportion 0.98560 0.98616 0.98670 0.98721 0.98770 0.98818 0.98866 0.98910 0.98954 0.98996 0.99038 0.9908 0.99116 0.99153 0.99190 0.99224
PC81 PC82 PC83 PC84 PC85 PC86 PC87 PC88 PC89 PC90 PC91 PC92 PC93 PC94 PC95 PC96
Standard deviation 1.00689 0.98558 0.97437 0.95802 0.9457 0.92739 0.92070 0.88670 0.87714 0.85761 0.85316 0.83535 0.81963 0.81142 0.80028 0.7743
Proportion of Variance 0.00034 0.00032 0.00032 0.00031 0.0003 0.00029 0.00028 0.00026 0.00026 0.00025 0.00024 0.00023 0.00022 0.00022 0.00021 0.0002
Cumulative Proportion 0.99258 0.99291 0.99322 0.99353 0.9938 0.99411 0.99440 0.99466 0.99492 0.99516 0.99541 0.99564 0.99586 0.99608 0.99630 0.9965
PC97 PC98 PC99 PC100 PC101 PC102 PC103 PC104 PC105 PC106 PC107 PC108 PC109 PC110 PC111 PC112
Standard deviation 0.7649 0.76021 0.75273 0.74718 0.73183 0.72181 0.70196 0.68991 0.67465 0.64963 0.64358 0.62706 0.62222 0.60842 0.59032 0.58577
Proportion of Variance 0.0002 0.00019 0.00019 0.00019 0.00018 0.00017 0.00016 0.00016 0.00015 0.00014 0.00014 0.00013 0.00013 0.00012 0.00012 0.00011
Cumulative Proportion 0.9967 0.99688 0.99707 0.99726 0.99744 0.99761 0.99778 0.99794 0.99809 0.99823 0.99837 0.99850 0.99863 0.99875 0.99887 0.99898
PC113 PC114 PC115 PC116 PC117 PC118 PC119 PC120 PC121 PC122 PC123 PC124 PC125 PC126 PC127 PC128
Standard deviation 0.57533 0.56329 0.5438 0.52257 0.50533 0.48740 0.47769 0.45341 0.44488 0.42188 0.39742 0.38655 0.34996 0.31899 7.295e-14 2.588e-15
Proportion of Variance 0.00011 0.00011 0.0001 0.00009 0.00009 0.00008 0.00008 0.00007 0.00007 0.00006 0.00005 0.00005 0.00004 0.00003 0.000e+00 0.000e+00
Cumulative Proportion 0.99909 0.99920 0.9993 0.99939 0.99947 0.99955 0.99963 0.99970 0.99976 0.99982 0.99988 0.99993 0.99997 1.00000 1.000e+00 1.000e+00
PC129 PC130 PC131
Standard deviation 2.588e-15 2.588e-15 2.588e-15
Proportion of Variance 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00
var_explained_3D = `3D_PAG_Original_output2_PCA4`$sdev^2 / sum(`3D_PAG_Original_output2_PCA4`$sdev^2)
barplot(100*var_explained_3D, xlab = '', ylab = '% variance explained')
summary(`2D_PAG_Original_output2_PCA`)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
Standard deviation 31.3992 12.85031 12.35032 8.70017 7.37012 6.75793 6.47591 6.38635 5.57640 5.51096 5.2134 4.50857 4.29606 4.02713 3.85259 3.50995
Proportion of Variance 0.5151 0.08628 0.07969 0.03955 0.02838 0.02386 0.02191 0.02131 0.01625 0.01587 0.0142 0.01062 0.00964 0.00847 0.00775 0.00644
Cumulative Proportion 0.5151 0.60138 0.68107 0.72062 0.74900 0.77286 0.79477 0.81608 0.83233 0.84819 0.8624 0.87302 0.88266 0.89113 0.89889 0.90532
PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32
Standard deviation 3.43901 3.17394 3.14350 3.04224 2.91319 2.87702 2.80798 2.71491 2.57654 2.49217 2.41036 2.39006 2.23801 2.15965 2.11663 2.03740
Proportion of Variance 0.00618 0.00526 0.00516 0.00484 0.00443 0.00432 0.00412 0.00385 0.00347 0.00324 0.00304 0.00298 0.00262 0.00244 0.00234 0.00217
Cumulative Proportion 0.91150 0.91676 0.92193 0.92676 0.93120 0.93552 0.93964 0.94349 0.94696 0.95021 0.95324 0.95623 0.95884 0.96128 0.96362 0.96579
PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48
Standard deviation 2.0064 1.92025 1.85173 1.82552 1.76628 1.70281 1.6396 1.56201 1.48288 1.45666 1.43162 1.37913 1.34851 1.30427 1.27947 1.24312
Proportion of Variance 0.0021 0.00193 0.00179 0.00174 0.00163 0.00151 0.0014 0.00127 0.00115 0.00111 0.00107 0.00099 0.00095 0.00089 0.00086 0.00081
Cumulative Proportion 0.9679 0.96982 0.97161 0.97335 0.97498 0.97650 0.9779 0.97918 0.98032 0.98143 0.98250 0.98350 0.98445 0.98534 0.98619 0.98700
PC49 PC50 PC51 PC52 PC53 PC54 PC55 PC56 PC57 PC58 PC59 PC60 PC61 PC62 PC63 PC64
Standard deviation 1.20682 1.13871 1.09202 1.08727 1.06053 1.02809 1.00146 0.95743 0.92966 0.8798 0.86526 0.84728 0.83606 0.81846 0.79825 0.78372
Proportion of Variance 0.00076 0.00068 0.00062 0.00062 0.00059 0.00055 0.00052 0.00048 0.00045 0.0004 0.00039 0.00038 0.00037 0.00035 0.00033 0.00032
Cumulative Proportion 0.98776 0.98844 0.98906 0.98968 0.99027 0.99082 0.99134 0.99182 0.99227 0.9927 0.99307 0.99344 0.99381 0.99416 0.99449 0.99481
PC65 PC66 PC67 PC68 PC69 PC70 PC71 PC72 PC73 PC74 PC75 PC76 PC77 PC78 PC79 PC80
Standard deviation 0.76540 0.73324 0.70956 0.69792 0.68822 0.67484 0.65748 0.64724 0.62961 0.6237 0.58493 0.56889 0.55998 0.53690 0.52693 0.51915
Proportion of Variance 0.00031 0.00028 0.00026 0.00025 0.00025 0.00024 0.00023 0.00022 0.00021 0.0002 0.00018 0.00017 0.00016 0.00015 0.00015 0.00014
Cumulative Proportion 0.99512 0.99540 0.99566 0.99592 0.99616 0.99640 0.99663 0.99685 0.99705 0.9973 0.99744 0.99760 0.99777 0.99792 0.99806 0.99820
PC81 PC82 PC83 PC84 PC85 PC86 PC87 PC88 PC89 PC90 PC91 PC92 PC93 PC94 PC95 PC96
Standard deviation 0.49638 0.48909 0.48244 0.47718 0.46388 0.46104 0.42638 0.41543 0.40873 0.37717 0.37270 0.35784 0.34475 0.33545 0.32007 0.31214
Proportion of Variance 0.00013 0.00012 0.00012 0.00012 0.00011 0.00011 0.00009 0.00009 0.00009 0.00007 0.00007 0.00007 0.00006 0.00006 0.00005 0.00005
Cumulative Proportion 0.99833 0.99846 0.99858 0.99870 0.99881 0.99892 0.99902 0.99911 0.99920 0.99927 0.99934 0.99941 0.99947 0.99953 0.99958 0.99963
PC97 PC98 PC99 PC100 PC101 PC102 PC103 PC104 PC105 PC106 PC107 PC108 PC109 PC110 PC111 PC112
Standard deviation 0.29791 0.27967 0.27474 0.27118 0.24296 0.23394 0.21947 0.21312 0.20249 0.19064 0.17617 0.15936 0.13919 0.12900 0.08422 0.0007454
Proportion of Variance 0.00005 0.00004 0.00004 0.00004 0.00003 0.00003 0.00003 0.00002 0.00002 0.00002 0.00002 0.00001 0.00001 0.00001 0.00000 0.0000000
Cumulative Proportion 0.99968 0.99972 0.99976 0.99980 0.99983 0.99986 0.99988 0.99991 0.99993 0.99995 0.99996 0.99998 0.99999 1.00000 1.00000 1.0000000
PC113 PC114 PC115 PC116 PC117 PC118 PC119 PC120 PC121 PC122 PC123 PC124 PC125
Standard deviation 7.069e-14 9.439e-15 3.652e-15 3.319e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15
Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
PC126 PC127 PC128 PC129 PC130 PC131
Standard deviation 2.267e-15 2.267e-15 2.267e-15 2.267e-15 2.267e-15 1.768e-15
Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
var_explained_2D = `2D_PAG_Original_output2_PCA`$sdev^2 / sum(`2D_PAG_Original_output2_PCA`$sdev^2)
barplot(100*var_explained_2D, xlab = '', ylab = '% variance explained')
For 3D descriptors - cummulative explained variance of 58.7% for PC1 and PC2, and 65.9% for PC1,2 & 3
For 2D descriptors - cummulative explained variance of 60.1% for PC1 and PC2, and 68.1% for PC1,2 & 3
Can we plot the 3D version, showing PC1, 2 & 3 together? scatterplot3d is a package that would do this - does ggplot2 do 3D plots?
Use PC1,2 & 3 in a model for Log(1/EC50) Use PC2Diii and PC3Diii as these contain the Log(1/ec50) values along with the PCs generated in the PCA.
#2D model
fit_2D_PCA <- lm(formula = Log.1.EC50. ~ PC1 + PC2 + PC3, data=PC2Diii) #fits the model
summary(fit_2D_PCA) # summary statistics
Call:
lm(formula = Log.1.EC50. ~ PC1 + PC2 + PC3, data = PC2Diii)
Residuals:
Min 1Q Median 3Q Max
-1.9409 -0.5209 -0.0368 0.4921 2.7478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.536152 0.092869 5.773 1.02e-07 ***
PC1 0.011330 0.003131 3.618 0.000482 ***
PC2 -0.005532 0.006986 -0.792 0.430493
PC3 0.010618 0.007472 1.421 0.158623
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9069 on 93 degrees of freedom
(34 observations deleted due to missingness)
Multiple R-squared: 0.1498, Adjusted R-squared: 0.1224
F-statistic: 5.463 on 3 and 93 DF, p-value: 0.001674
This doesn’t produce a good model. R2 of only 0.1498 and R2adj of 0.1224, additionally only the PC1 variable appears to be statistically significant.
Plot a predicted against actual
Log1EC50_noNA<- na.omit(PC2Diii$Log.1.EC50.)
plot(Log1EC50_noNA, fitted(fit_2D_PCA))
abline(lm(formula = Log.1.EC50. ~ PC1 + PC2 + PC3, data=PC2Diii))
only using the first two of 4 regression coefficients
#2D model
fit_3D_PCA <- lm(formula = Log.1.EC50. ~ PC1 + PC2 + PC3, data=PC3Diii) #fits the model
summary(fit_3D_PCA) # summary statistics
Call:
lm(formula = Log.1.EC50. ~ PC1 + PC2 + PC3, data = PC3Diii)
Residuals:
Min 1Q Median 3Q Max
-1.92300 -0.54886 -0.07387 0.51816 2.75523
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.540878 0.092366 5.856 7.1e-08 ***
PC1 0.008644 0.002601 3.324 0.00127 **
PC2 -0.008609 0.004795 -1.795 0.07587 .
PC3 0.010095 0.006373 1.584 0.11656
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9027 on 93 degrees of freedom
(34 observations deleted due to missingness)
Multiple R-squared: 0.1578, Adjusted R-squared: 0.1307
F-statistic: 5.81 on 3 and 93 DF, p-value: 0.001103
This also doesn’t produce a good model. R2 of only 0.1578 and R2adj of 0.1307, additionally only the PC1 variable appears to be statistically significant.
Plot a predicted against actual
#Log1EC50_noNA<- na.omit(PC3Diii$Log.1.EC50.)
plot(Log1EC50_noNA, fitted(fit_3D_PCA))
abline(lm(formula = Log.1.EC50. ~ PC1 + PC2 + PC3, data=PC3Diii))
only using the first two of 4 regression coefficients