READ ME File For dataset in support of my doctoral thesis 'Machine learning-assisted railway simulation modelling'. Dataset DOI: 10.5258/SOTON/D2707 ReadMe Author: Joanna Knight, University of Southampton, https://orcid.org/0000-0002-2558-2815 This dataset supports the thesis entitled 'Machine learning-assisted railway simulation modelling' AWARDED BY: Univeristy of Southampton DATE OF AWARD: 2023 DESCRIPTION OF THE DATA 1. Raw data archived data from Network Rail's open data feed 2. Processed data developed in this thesis of the 'train movements' and 'train conflicts' datasets 3. Models and results from Chapters 6, 7, 8 and 9 The dataset is available 'on request' due to its size 71GB https://library.soton.ac.uk/datarequest This dataset contains: =========== 1. Raw data =========== All the raw data is archived data from Network Rail's open data feed which is covered by the Open Government Licence (https://www.networkrail.co.uk/who-we-are/transparency-and-ethics/transparency/open-data-feeds/network-rail-infrastructure-limited-data-feeds-licence/) Schedule_data.zip The 'full' and 'update' schedule files between 20180406 and 201210931. There will be one 'update' file for most days in the time period (there may be a few missing due to latency issues), and approximately one 'full' file per week. Each file is further zipped with the .gz extension. Each file is in CIF format. The CIF format is the industry standard and is a plain text file with each record a fixed length of 80 characters and is the format used in this work. For further details on how to interpret the CIF files, please refer to Appendix A of the thesis or https://wiki.openraildata.com/index.php?title=SCHEDULE TRUST_data_2018.zip TRUST_data_2019.zip TURST_data_2020.zip TRUST_data_2021.zip The TRUST data files zipped up by year. Each file within will be further zipped with the .tbz2 extension depending on how the file was downloaded originally. The TURST files are in JSON format and for more information on their content, please refer to Appendix A of the thesis or https://wiki.openraildata.com/index.php?title=Train_Movements TD_data_2018.zip TD_data_2019.zip TD_data_2020.zip TD_data_2021.zip The TD files are zipped up by year. Each file within is further zipped with the .tbz2 extension. The data are in JSON format and for more information on their content, please refer to Appendix A of the thesis or https://wiki.openraildata.com/index.php?title=C_Class_Messages SMART_reference_data.zip The SMART data file is zipped and was downloaded from Network Rail's open data feed on 20190819 and is a JSON file. For more information on their content, please refer to Appendix A of the thesis or https://wiki.openraildata.com/index.php?title=Reference_Data CORPUS_reference_data.gz The CORPUS data file is zipped using the .gz extension and was downloaded from Network Rail's open data feed on 20190819 and is a JSON file and for more information on their content, please refer to Appendix A of the thesis or https://wiki.openraildata.com/index.php?title=Reference_Data Delay_Attribution.zip This file contains all the delay attribution data CSV files which were downloaded from Network Rail's website (currently found here: https://www.networkrail.co.uk/who-we-are/transparency-and-ethics/transparency/open-data-feeds/). This zipped file also contains data from Network Rail's website explaining the data source and a glossary of terms - these should provide all the information required to understand the files. ================= 2. Processed data ================= These are covered by the license CC BY and are datasets developed in the work for this thesis. Train_Movements.zip This dataset is zipped and contains CSV files. The files were processed from the raw data sources and contains the train movement data used for machine learning models to predict the train travel durations. There is one folder per movement type, train type and berth combination for berths in the Havant train describer area. The movement types are: restricted and unrestricted. The train types are: freight, ECS and passenger. Each folder contains at least fourteen files of 'folds'. Each fold relates to 90 days worth of movement data from 20180406 and are named 'batches' in the thesis (see Section 6.3.2). Note that in the thesis, the batches/folds are numbered from 1, but in the files they are numbered from 0. Many of the berths will contain a small file of 'fold_14' - this was not used in the work and only exists for data greater than the date 20210916. For more information on the construction and use of these data, please refer to Chapters 5 and 6 of the thesis. Train_Conflicts.zip This dataset is zipped and contains CSV files. The files were processed from the raw data sources and contains the train conflicts data used for machine learning models to predict the train travel durations. There is one folder conflict pair, and there are six conflict pairs modelled in this thesis. Each folder contains at fourteen files of 'folds'. Each fold relates to 90 days worth of conflict data from 20180406 and are named 'batches' in the thesis (see Section 8.3.1). The batches/folds are numbered from 1 both in the thesis and the files. For more information on the construction and use of these data, please refer to Chapters 5 and 8 of the thesis. ===================== 3. Models and Results ===================== These are covered by the license CC BY and are models and results produced in the work for this thesis. results_train_movements.zip This file is zipped and contains files of results and models relating to train movements. The data are divided into four folders: 01_model_exploration - results presented in Section 6.4 of the thesis 02_model_selection - results presented in Section 6.5 of the thesis 03_model_final_evaluation - models and results presented in Section 6.6 of the thesis. Note that the machine learning models are saved as pickle files (pkl) 04_simulation_results - the results presented in Chapter 7 of the thesis. results_train_conflicts.zip This file is zipped and contains files of results and models relating to train conflicts. The data are divided into four folders: 01_model_exploration - results presented in Section 8.4 of the thesis 02_model_selection - results presented in Section 8.5 of the thesis 03_model_final_evaluation - models and results presented in Section 8.6 of the thesis. Note that the machine learning models are saved as pickle files (pkl) 04_simulation_results - the results presented in Chapter 9 of the thesis. Date of data collection: 2019-08-01 to 2023-03-31 Licence: Open Government Licence for raw data CC BY for processed data, models and results. Related projects/Funders: This research was funded by Network Rail and EPSRC through an Industrial CASE research studentship Related publication: Thesis entitled 'Machine learning-assisted railway simulation modelling'. Date that the file was created: August, 2023