READ ME File For 'Exploring chemical space for computational materials discovery' DOI: 10.5258/SOTON/PG/D274 Date that the file was created: May 2026 ------------------- GENERAL INFORMATION ------------------- ReadMe Author: Jay Johal, University of Southampton [ORCID: 0000-0001-8489-4803] Date of data collection: 2021-2025 Information about geographic location of data collection: University of Southampton, UK -------------------------- SHARING/ACCESS INFORMATION -------------------------- Licenses/restrictions placed on the data, or limitations of reuse: CC-BY Recommended citation for the data: This dataset supports the thesis: AUTHORS: Jay Johal TITLE: Exploring chemical space for computational materials discovery THESIS DOI: 10.5258/SOTON/PG/T274 -------------------- DATA & FILE OVERVIEW -------------------- This dataset contains: Chapter4.zip - "Porous_molecule_CSP_porosity_data" folder containing, for each molecule used for benchmarking the porous sub-sampling schemes: - A *structures.csv file, containing information on each of the predicted crystal structures. The headers and units are shown below: crystal id (id) | spacegroup | density [g/cm^3] | energy [kJ/mol] | minimization_step | quasi-random seed (trial_number) | minimization_time (s) - A *.csv file, containing information on each of the predicted crystal structures' porosity metrics. The headers and units are shown below: id | Largest Included Sphere (Å) |Largest Free Sphere (Å) | Surface Area (m^2/cm^3) - *_sampling_evaluations.csv files Summaries of the effectiveness of different CSP sub-sampling schemes to recover the CSP landscapes of more complete CSP samplings: - QR_vs_QRBH_sampling_evaluations.csv - Porous_molecule_sampling_evaluations.csv - Porous_molecule_leading_edge_sampling_evaluations.csv (Further data for Chapter 4 and Chapter 5 of the thesis may be found in the associated Dataset: Johal, Jay and Day, Graeme (2025) CSP-EA sampled molecules and crystal structures, with associated mobilities. University of Southampton doi:10.5258/SOTON/D3613) Chapter6.zip - "Multi_objective_config_files" folder contains the final config (.ini) files for each CSP-GA search performed in Chapter 6 ("Organic Semiconductors") of the thesis. Contained within each file are all the settings to set up the calculations, as well as the molecules sampled within each generation. - Summary .csv files for all molecules sampled in the CSP-GAs on the OCELOT chemical space: - Ocelot_mobility_avg_data.csv (For the CSP-GA searches only targeting high mobility candidate molecules) - Ocelot_multi-objective_data.csv (For the CSP-GA searches targeting high mobility and low electron affinity candidate molecules) Each has headers: SMILES | electron mobility (cm^2 /Vs) | electron affinity (eV) Chapter7.zip - Contains the final config (.ini) files for each CSP-GA search performed in Chapter 7 ("Improving GA Efficiency") of the thesis, as part of parameterisation and multi-fidelity testing. Contained within each file are all the settings to set up the calculations, as well as the molecules sampled within each generation. - Data is organised into 2 sub-folders, with each divided further into each parameter/ method tested: > Parameterisation (5 repeats were performed for each test): - Elitism Method - Elitism Value - Population Size - Duplicate Handling > Multi-fidelity Methods (3 repeats were performed for each test): - Continuation CSP-GA - Course Correction CSP-GA (Every 5 generations) - Course Correction CSP-GA (Every 10 generations) - Increase Sampling per Generation CSP-GA - Increase Sampling per hit CSP-GA - Island Champions CSP-GA -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used for collection/generation of data: See accompanying thesis. Methods for processing the data: Post-processed using in-house Python code.