READ ME File For 'Exploring chemical space for computational materials discovery'

DOI: 10.5258/SOTON/PG/D274

Date that the file was created: May 2026

-------------------
GENERAL INFORMATION
-------------------

ReadMe Author: Jay Johal, University of Southampton [ORCID: 0000-0001-8489-4803]

Date of data collection: 2021-2025

Information about geographic location of data collection: University of Southampton, UK

--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

Licenses/restrictions placed on the data, or limitations of reuse: CC-BY

Recommended citation for the data: 

This dataset supports the thesis:
AUTHORS: Jay Johal
TITLE: Exploring chemical space for computational materials discovery
THESIS DOI: 10.5258/SOTON/PG/T274

--------------------
DATA & FILE OVERVIEW
--------------------

This dataset contains:

Chapter4.zip
    - "Porous_molecule_CSP_porosity_data" folder containing, for each molecule used for benchmarking the porous sub-sampling schemes:
	- A *structures.csv file, containing information on each of the predicted crystal structures. The headers and units are shown below:
      crystal id (id) | spacegroup | density [g/cm^3] | energy [kJ/mol] | minimization_step | quasi-random seed (trial_number) | minimization_time (s)

	- A *.csv file, containing information on each of the predicted crystal structures' porosity metrics. The headers and units are shown below:
      id | Largest Included Sphere (Å) |Largest Free Sphere (Å) | Surface Area (m^2/cm^3)

   - *_sampling_evaluations.csv files
     Summaries of the effectiveness of different CSP sub-sampling schemes to recover the CSP landscapes of more complete CSP samplings:
	- QR_vs_QRBH_sampling_evaluations.csv
	- Porous_molecule_sampling_evaluations.csv
	- Porous_molecule_leading_edge_sampling_evaluations.csv

(Further data for Chapter 4 and Chapter 5 of the thesis may be found in the associated Dataset: Johal, Jay and Day, Graeme (2025) CSP-EA sampled molecules and crystal structures, with associated mobilities. University of Southampton doi:10.5258/SOTON/D3613)

Chapter6.zip
    - "Multi_objective_config_files" folder contains the final config (.ini) files for each CSP-GA search performed in Chapter 6 ("Organic Semiconductors") of the thesis. Contained within each file are all the settings to set up the calculations, as well as the molecules sampled within each generation.
    - Summary .csv files for all molecules sampled in the CSP-GAs on the OCELOT chemical space:
    	- Ocelot_mobility_avg_data.csv (For the CSP-GA searches only targeting high mobility candidate molecules)
    	- Ocelot_multi-objective_data.csv (For the CSP-GA searches targeting high mobility and low electron affinity candidate molecules)
	Each has headers: 
    SMILES | electron mobility (cm^2 /Vs) | electron affinity (eV)

Chapter7.zip
    - Contains the final config (.ini) files for each CSP-GA search performed in Chapter 7 ("Improving GA Efficiency") of the thesis, as part of parameterisation and multi-fidelity testing. Contained within each file are all the settings to set up the calculations, as well as the molecules sampled within each generation.
    - Data is organised into 2 sub-folders, with each divided further into each parameter/ method tested:
	> Parameterisation (5 repeats were performed for each test):
		- Elitism Method
		- Elitism Value
		- Population Size
		- Duplicate Handling
	> Multi-fidelity Methods (3 repeats were performed for each test):
		- Continuation CSP-GA
		- Course Correction CSP-GA (Every 5 generations)
		- Course Correction CSP-GA (Every 10 generations)
		- Increase Sampling per Generation CSP-GA
		- Increase Sampling per hit CSP-GA
		- Island Champions CSP-GA

--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection/generation of data: See accompanying thesis.

Methods for processing the data: Post-processed using in-house Python code.