READ ME File For 'CSP-generated crystal structures of 1,000+ rigid organic molecules' DOI: 10.5258/SOTON/D3094 Date that the file was created: May 2024 ------------------- GENERAL INFORMATION ------------------- ReadMe Author: Christopher Taylor, University of Southampton [ORCID: 0000-0001-9465-5742] Date of data collection: 2022-2024 Information about geographic location of data collection: University of Southampton, UK -------------------------- SHARING/ACCESS INFORMATION -------------------------- Licenses/restrictions placed on the data, or limitations of reuse: CC-BY Recommended citation for the data: This dataset supports the publication: AUTHORS: Christopher R. Taylor, Patrick W. V. Butler, Graeme M. Day TITLE: Predictive crystallography at scale: mapping, validating, and learning from 1,000 crystal energy landscapes JOURNAL: Faraday Discussions PAPER DOI IF KNOWN: -------------------- DATA & FILE OVERVIEW -------------------- This dataset contains: 3 ZIP archives containing the crystal structure prediction (CSP) landscapes of 1007 unique molecules. - Within each archive, each CSP landscape is named by its 6-letter CCDC refcode stem, with a directory containing a comma-separated variable (.CSV) file of their lattice energies and densities. - The first archive (FIT-DMA_Landscapes) contains the landscapes optimised at the FIT+DMA (i.e. empirical force field and electrostatic multipole) level of theory. - The first archive (FIT-DMA_Landscapes) additionally contains the Crystallographic Information Framework (CIF) files, one in each directory, which consolidates all the crystal structures described on that landscape. - The second archive (Delta-ML_Landscapes) contains the landscapes for those re-optimised using our neural-network potential (NNP). - It also contains two text files indicating which landscapes were used for training the NNP model, and which were extrapolated to using the NNP model. - [The second archive, containing only energetic re-rankings of the original structures, requires no new CIF files.] - The third archive (MACE_Reopt_Landscapes) contains the subset of CSP landscapes that were re-optimised using the machine-learned MACE model. 2 ZIP archives, each containing a machine-learned (ML) lattice energy model: NNP_correction: - Contains the committee neural networks trained to correct FIT+DMA lattice energies to B86bPBE+XDM lattice energies. - There are individual directories for each iteration of active learning, with iteration 1 being the model trained on only randomly sampled structures. - Within each iteration directory there is a 'committee' directory containing parameter files for each member of the committee, the parameters for each member collected in separate subdirectories labelled NN1 to NN8. - The parameter files are in the n2p2 format and are called weights.XXX.data where XXX is the atomic number. - Additionally, in the parent iteration directory there are 'input.nn' and 'scaling.data' files, which are n2p2 input files that define the neural network settings and symmetry functions used and the scaling applied to the symmetry functions. MACE_total_energy: - Contains the MACE model trained on PBE+D3 total energy data and used for geometry optimisations of crystal structures. We also include a comma-separated variable (CSV) file, ranks_of_expt_matches.csv, of all matches and rankings to the experimentally-observed crystal structures of these molecules as a separate file. Finally, we include several text (TXT) files, containing lists of all unique molecules according to their CSD Refcode stem, as well as files listing all those molecules that are chiral vs non-chiral, and finally a file listing those chiral molecules observed to crystallise in Sohncke (enantiopure) space groups in the CSD. Relationship between files, if important for context: -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used for collection/generation of data: See accompanying publication. Methods for processing the data: Post-processed using in-house Python code to generate CIF files and CSV files of energy rankings. Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers: - Using the NNP correction model files requires the n2p2 package available at: https://github.com/CompPhysVienna/n2p2. - The MACE total energy model is in the PyTorch format and can be used via the MACE python package available at: https://github.com/ACEsuit/mace.