READ ME File For 'Models and Methods for Integer Programming under Structure and Uncertainty Dataset' Dataset DOI: 10.5258/SOTON/PG/D118 ReadMe Author: Montree Jaidee, University of Southampton ORCID ID: 0009-0009-4298-8356 This dataset supports the thesis entitled 'Models and Methods for Integer Programming under Structure and Uncertainty' AWARDED BY: University of Southampton DATE OF AWARD: 2026 ------------------- LICENSE INFORMATION ------------------- This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to: - Share: copy and redistribute the material in any medium or format - Adapt: remix, transform, and build upon the material for any purpose, even commercially Under the following terms: - Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. Full license text: https://creativecommons.org/licenses/by/4.0/ Suggested citation: Jaidee, M. (2026). Models and Methods for Integer Programming under Structure and Uncertainty Dataset [Data set]. University of Southampton. https://doi.org/10.5258/SOTON/PG/D118 -------------------- DATA & FILE OVERVIEW -------------------- This dataset contains 4 folders corresponding to each chapter in the thesis, Chapter_4 through Chapter_7. Chapter_4/ (33 files) Binary classification benchmark datasets in NumPy binary format (.npy), each split into two disjoint partitions (suffix _A and _B): adult_A.npy, adult_B.npy bank_marketing_A.npy, bank_marketing_B.npy blood_transfusion_A.npy, blood_transfusion_B.npy breast_cancer_wisconsin_A.npy, breast_cancer_wisconsin_B.npy covertype_A.npy, covertype_B.npy credit_default_A.npy, credit_default_B.npy magic_gamma_A.npy, magic_gamma_B.npy occupancy_detection_A.npy, occupancy_detection_B.npy online_shoppers_A.npy, online_shoppers_B.npy pima_diabetes_A.npy, pima_diabetes_B.npy rice_A.npy, rice_B.npy skin_segmentation_A.npy, skin_segmentation_B.npy spambase_A.npy, spambase_B.npy vertebral_column_A.npy, vertebral_column_B.npy Synthetic high-dimensional datasets in pickle format (.pkl): HighDim_low_noise.pkl - High-dimensional dataset with low noise HighDim_medium_noise.pkl - High-dimensional dataset with medium noise HighDim_high_noise.pkl - High-dimensional dataset with high noise Other instance files: Vision_Instances.pkl - Instances derived from computer vision data test_instances.pkl - Test problem instances Chapter_5/ (7 files across 3 subdirectories) dictionaries/ distance_dict.json - Pairwise distance dictionary between UK postcode sectors distance_dict.json.pbz2 - Compressed version of distance_dict.json (bz2 format) travel_dict.json - Pairwise travel time dictionary between UK postcode sectors instance_PostcodeSector_travel_dict.json.pbz2 - Compressed travel dictionary for problem instances instances/ Sector dataset.xlsx - Dataset of UK postcode sectors with associated attributes map_data/ all_sectors.geojson - GeoJSON boundary file for all UK postcode sectors Counties_and_Unitary_Authorities_April_2019_Ultra_Generalised_Boundaries_EW_2022_1790725369940793023.geojson - GeoJSON boundary file for counties and unitary authorities in England and Wales (April 2019, ultra-generalised) Chapter_6/ (4 files) data_baseline_100.csv - 100 baseline demand scenarios across 12 time periods (t0-t11) data_dips_100.csv - 100 demand scenarios featuring periodic dips across 12 time periods data_sparse_100.csv - 100 sparse demand scenarios across 12 time periods data_costs.csv - Per-period reward and penalty cost parameters for 12 time periods Chapter_7/ (4 files + 616 scenario files in subdirectory) EMHIRES_PV_2015.csv - Hourly solar photovoltaic (PV) capacity factors for 2015 from the EMHIRES dataset, covering 36 European countries (8,760 rows) EMHIRES_wind_2015.csv - Hourly wind capacity factors for 2015 from the EMHIRES dataset, covering 37 European countries (8,760 rows) df_pv.csv - Processed hourly PV capacity factor data for European countries (8,760 rows) df_wind.csv - Processed hourly wind capacity factor data for European countries (8,760 rows) scenarios/ - 616 CSV files of ARMA-generated renewable energy scenarios. Files follow the naming convention {type}_arma_{country}_{set_N}.csv, where {type} is 'pv' or 'wind', {country} is a two-letter country code (28 countries: AT, BE, BG, CH, CZ, DE, DK, EE, EL, ES, FI, FR, HR, HU, IE, IT, LT, LU, LV, NL, NO, PL, PT, RO, SE, SI, SK, UK), and {set_N} is the set index (1-11). Each file contains 24 hourly rows and 100 scenario columns (scenario_0 to scenario_99). Relationship between files: - Chapter_4: The _A and _B file pairs for each dataset represent disjoint partitions of the same source dataset. The freeze_sol.npy and freeze_sol2.npy files are solutions associated with freeze_instance.npy. - Chapter_5: The .json.pbz2 files are bz2-compressed equivalents of the corresponding .json dictionary files. The 'Sector dataset.xlsx' instances file corresponds to the postcode sectors represented in the map_data GeoJSON files. - Chapter_7: The EMHIRES_*.csv files are the source data from which df_pv.csv and df_wind.csv are derived. The scenario files in the scenarios/ subdirectory are ARMA-generated samples based on the processed EMHIRES data. Additional related data collected that was not included in the current data package: The benchmark datasets in Chapter_4 were sourced from the UCI Machine Learning Repository (https://archive.ics.uci.edu/) and are publicly available at their original sources. The EMHIRES dataset in Chapter_7 is publicly available from the European Commission's Joint Research Centre (https://setis.ec.europa.eu/EMHIRES-datasets). -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used for collection/generation of data: Chapter_4: Standard binary classification benchmark datasets were obtained from the UCI Machine Learning Repository and other public sources. Each dataset was pre-processed and split into two disjoint partitions (A and B). High-dimensional synthetic datasets were generated programmatically with varying levels of additive noise. Vision instances were derived from image classification data. Problem instances were constructed for use in integer programming experiments. Chapter_5: Geographic and travel time data for UK postcode sectors were compiled from publicly available sources. Distance and travel time matrices were computed between postcode sector centroids. Boundary geometries were obtained from the Office for National Statistics (ONS) Open Geography Portal. Chapter_6: Scenario data were generated synthetically to represent different demand patterns (baseline, dips, and sparse) over a 12-period planning horizon. Cost parameters (rewards and penalties) were generated to reflect realistic cost structures for a stochastic integer programming problem. Chapter_7: Hourly solar PV and wind capacity factor data for 2015 were obtained from the EMHIRES (European Meteorological derived HIndcast of REnewable Energy Sources) dataset published by the European Commission's Joint Research Centre. Scenario sets were generated using ARMA (AutoRegressive Moving Average) time-series modelling fitted to the EMHIRES data, producing 100 stochastic scenarios per 24-hour period for each country and energy type. Methods for processing the data: Chapter_4: Raw datasets were cleaned, normalised, and split into A/B partitions. Synthetic datasets were generated using controlled random processes with varying noise levels. Chapter_5: Distance and travel time dictionaries were constructed from geographic coordinate data. Large dictionaries were compressed using the bz2 format for efficient storage. Chapter_7: Raw EMHIRES CSV files were processed to extract and reformat country-level capacity factors. ARMA models were fitted to historical data and used to generate forward scenario sets. Software- or Instrument-specific information needed to interpret the data: - Python 3.x is required to read .npy files (numpy library) and .pkl files (pickle module). - Python libraries: numpy, pandas, json, pickle, bz2 (all standard or widely available). - Microsoft Excel or compatible software (e.g., LibreOffice Calc) to open .xlsx files. - GIS software (e.g., QGIS) or Python geopandas/folium to visualise .geojson files. - All .csv files can be opened with any spreadsheet software or text editor. - Compressed .json.pbz2 files can be decompressed using Python's bz2 module. Describe any quality-assurance procedures performed on the data: Benchmark datasets in Chapter_4 were verified against their published descriptions from the UCI repository. Scenario files in Chapter_7 were checked to ensure correct country codes, set indices, and consistent dimensionality (24 hours × 100 scenarios per file). People involved with sample collection, processing, analysis and/or submission: Montree Jaidee, University of Southampton (data processing and submission) Date that the file was created: May 2026