READ ME File for 'Machine Learning Methods for Analysis of Organic Molecular Crystal Structure Prediction Landscapes - associated data'

Dataset DOI: https://doi.org/10.5258/SOTON/D3723

READ ME author: Jennifer Martin, University of Southampton ORCID ID: 0009-0004-0343-6309


This dataset supports the thesis entitled 'Machine Learning Methods for Analysis of Organic Molecular Crystal Structure Prediction Landscapes'

AWARDED BY: University of Southampton


Date of data collection: 09/2021 - 10/2025


Geographic location of data location: University of Southampton, Southampton


License: Files are licensed individually. Chapter_3.zip data is licenced under GNU GPL . All other files are licensed under CC-BY


Related Funders: Funding provided by the Leverhulme Trust via the Leverhulme Research Centre for Functional Materials Design

----------------------
Data and File Overview
----------------------

The dataset is split into zip files for each chapter of the thesis- containing the key data associated with that chapter of the thesis. An overview os provided here, see additional READ ME files and associated thesis for more details.

Chapter_3.zip - Three python scripts to reindex structures, calculate kernels and construct the Generalised Convex Hull (GCH). Some content in these scripts is adapted from open-source codes (The original gch libraries: https://github.com/andreanelli/GCH and the DirectionalConvexHull code in: https://github.com/scikit-learn-contrib/scikit-matter/blob/main/src/skmatter/sample_selection/_base.py) as cited in the scripts

Chapter_4.zip - Key Crystal Structure Prediction (CSP) data on the systems of DAP, primidone, CL-20, NTCDA, PTCDA, MePTCDI. MeNTCDI including final clustered structure sets (zip files) .csv files of energy and density, and .txt. files of experimental matches where available for the predictions performed as part of the work. Files are sorted by system. For the cases of primidone and cl-20, key structure data is available at different levels of theory (in separate directories). Data for testing duplicate removal thresholds (e.g COMPACK match files and PXRD comparison data) is provided in its own subdirectory.

Chapter_5.zip - Key data on the efficiency of the different GCH constructions in identifying stabilisable structures. Including candidate pools, eigenspectral/cumulative variance arrays to explore dimensionality, and shaken candidate pools. Data available for each explored system, kernel type, and cut-off radius. Candidate pools and shaken candidate pools are in one directory, eignespectra and variance arrays in another, and the final subdirectory (size_testing) gives shaken candidate pool files for T2, using different numbers of iterations

Chapter_6.zip - Key data for exploring relationships between ML descriptors and intuitive descriptors. Including calculated intuitive descriptor values and relevant data Intuitive descriptors subdirectory), as well as values for systematic assessment of relationship strength - such as R-squared value arrays and Support Vector Classification(SVC) model errors  (systematic_assessment subdirectory), and kPCA projections for teh systems/kernels explored in Chapter 6 (kpca_projections_for_interpretation subdirectory)


Chapter_7.zip - Key data for machine-learning of energies, including measures of prediction error and the calculated DFT energies of chlorpropamide crystal structures in the extended set - both for initial training and cross validated data (Large_Training subdirectory) and prediction errors for chlorpropamide, target 31, and target 32 in the small studies (Small_Training subdirectory)


Chapter_8.zip - Key data for templating CSP, including final minimised analogues (zip files of res files in final_minimised_analogues subdirectory)  , data on matches to known structures (.txt files in polymorph_match_files subdirectory)  and the target landscape (.txt files in landscape_match_files subdirectory). Nested subdirectories within each of these provide data on specific template-target pairs. There are also scripts for analogue formation (Scripts sub directory), and .csv files for the number of analogues formed and target structures recovered, and the number of target structures, in each case (key_results subdirectory). 

Supplementary READ ME files within each chapter give additional information such as explanations of variable/file names etc
Throughout the work, the name 'basic' refers to the average kernel, 'equiv atom' or 'adapted' refers to the adapted kernel. Dimensionalities in file names include energy - e.g '2 Dim' uses one ML descriptor.

------------------------
Methodological Information
-------------------------

Key methodological details are provided in the associated thesis.

Date: October 2025