READ ME File For 'Data associated with the PhD thesis "The Topological and Geometric Analysis of Organic Crystal systems" Dataset DOI: 10.5258/SOTON/D3240 ReadMe Author: J.R. Doyle, University of Southampton This dataset supports the thesis entitled The Topological and Geometric Analysis of Organic Crystal systems AWARDED BY: Univeristy of Southampton DATE OF AWARD: 2024 DESCRIPTION OF THE DATA This dataset contains: **files JD_thesis_figures.zip and JD_thesis_emd_csp_data.zip are available on request due to large size - these may be added to the Pure record in future when a bug in the software is fixed** This is the data associated with the PhD thesis "The Topological and Geometric Analysis of Organic Crystal systems". There are seven zip files enclosed. Five pertain to the data used within the research. The other two contain, respectively, the figures used in the phd thesis and the most important python scripts used in the research. A brief description of each dataset is provided below. - JD_thesis_scripts.zip contains the most important pyhton scripts used in this research. It was not possible to provide code that can carry out all the calculations used in my work as much of this was carried out in jupyter notebooks (the notebooks themselves are extensive and quite nonlinear in nature so would be difficult to follow so these are not provided). Instead I provide some of the most important workhorse scripts used in my work as well as guidance as to how to compute the necerssary persistent homology calculations in order to reproduced to figures in my thesis. Within the zip file there is another markdown file that explains what each script does and how the scripts can be used in conjunction with one another and small snippets of pyhton code using the gudhi library to obtain the desired results. - JD_thesis_pah_exp_data contains the smaller sets of polyaromatic hydrocarbon data obtained from the CSD. There are two datasets labelled set_1 and set_2. Set_1 is a small set of 28 polyaromatic hydrocarbons obtained from the CSD manually after the structures and packing types described in the work of Desiraju et al. https://pubs.rsc.org/en/content/articlelanding/1989/c3/c39890000621 . The second set contains 172 experimental structures which were described by Loveland et al. https://pubmed.ncbi.nlm.nih.gov/33245232/ . The cif files are provided and partitioned by the packing labels given in the two papers - JD_thesis_gazces_data contains the dataset pertaining to the nicotinamide:benzoic acid co-crystal system. These are labelled according to the predicted funnel on the associated potential energy surface. The funnels are labelled "0" and "1" and the data are seperated accordingly - JD_thesis_pah_aza_csp_data contains the predicted crystal structures for a set of azapentacenes and polyaromatic hydrocarbons with predicted packing labels. As discussed in the thesis there are two datasets. The first uses older CSP code and labells using the algorithm of Campbell et al. https://pubs.rsc.org/en/content/articlelanding/2017/tc/c7tc02553j - this is called old_method. The second uses updated CSP code and labels from the Autopack algorithm of Loveland et al. https://pubmed.ncbi.nlm.nih.gov/33245232/ - this is called new_method. Again there are sets of cif files partitioned by predicted label (or NA/other is a label was not or could not be found) . The associated energies which were used for prediction and classification tasks in the PhD thesis are provided in the CSP metadata for each compound which is given as a csv file - JD_thesis_emd_exp_data this is the set of fluorinated benzylideneanilines (fluoroalanines) with packing labels found by Dodd et al. . The set of cif files are labelled according to two packing schemes. Packing scheme 1 involves geometric inspection by eye (visual packing scheme) while the second packing scheme (packing scheme 2) involves a geometric argument involving interplanar angles, intercentroid distances and space groups (geometric packing scheme). More information can be found in the PhD thesis of Eleanor Dodd https://eprints.soton.ac.uk/447443/ - JD_thesis_figures contains all figures used in the compilation of the phd theis partitioned by the dataset to which they correspond. In addition there is a large set of crystal landscapes of the predicted fluorinated benzylideneaniline structures obtained using a variety of different dimensionality reduction algorithms. As there were 92 of these lanscapes (for each dimensionality reduction routine) only four of these are shown in the phd thesis itself - the rest merely alluded to. They can be found here. The figures that describe the energy distribuution of the crystal structure landscape by packing class can also be found here. There are also figures in which the landscapes are labelled according to density, as opposed to energy, which are not included in the thesis. All of these fifures are found in the landscape subdirectory fo the emd_structures directory. The subset of these figures which are used in the phd thesis are in the examples folder. There is also another readme file that explains the labelling conventions of those figures not included in the thesis. This file is availble on request due to its large size but may be added to the record in future. - JD_thesis_emd_csp_data contains the cif files needed for each of the 92 crystal structure landscapes found for the fluoroalanines labelled by the compound for which the crystal structure landscape was generated. For each landscape there is a set of cif files and a csv file which contains all the metadata from the CSP process which includes the predicted energy and density. This file is available on request due to its large size but may be added to this record in the future. Date of data collection: 2020-2024 Licence:CC-BY Date that the file was created: Sept, 2024 --------------