READ ME File For 'Models and Methods for Integer Programming under Structure and Uncertainty Dataset'

Dataset DOI: 10.5258/SOTON/PG/D118

ReadMe Author: Montree Jaidee, University of Southampton ORCID ID: 0009-0009-4298-8356

This dataset supports the thesis entitled 'Models and Methods for Integer Programming under Structure and Uncertainty'
AWARDED BY: University of Southampton
DATE OF AWARD: 2026


-------------------
LICENSE INFORMATION
-------------------

This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:
  - Share: copy and redistribute the material in any medium or format
  - Adapt: remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:
  - Attribution: You must give appropriate credit, provide a link to the license, and indicate
    if changes were made. You may do so in any reasonable manner, but not in any way that
    suggests the licensor endorses you or your use.

Full license text: https://creativecommons.org/licenses/by/4.0/

Suggested citation:
  Jaidee, M. (2026). Models and Methods for Integer Programming under Structure and Uncertainty
  Dataset [Data set]. University of Southampton. https://doi.org/10.5258/SOTON/PG/D118


--------------------
DATA & FILE OVERVIEW
--------------------

This dataset contains 4 folders corresponding to each chapter in the thesis, Chapter_4 through Chapter_7.

Chapter_4/  (33 files)
  Binary classification benchmark datasets in NumPy binary format (.npy), each split into two disjoint
  partitions (suffix _A and _B):
    adult_A.npy, adult_B.npy
    bank_marketing_A.npy, bank_marketing_B.npy
    blood_transfusion_A.npy, blood_transfusion_B.npy
    breast_cancer_wisconsin_A.npy, breast_cancer_wisconsin_B.npy
    covertype_A.npy, covertype_B.npy
    credit_default_A.npy, credit_default_B.npy
    magic_gamma_A.npy, magic_gamma_B.npy
    occupancy_detection_A.npy, occupancy_detection_B.npy
    online_shoppers_A.npy, online_shoppers_B.npy
    pima_diabetes_A.npy, pima_diabetes_B.npy
    rice_A.npy, rice_B.npy
    skin_segmentation_A.npy, skin_segmentation_B.npy
    spambase_A.npy, spambase_B.npy
    vertebral_column_A.npy, vertebral_column_B.npy

  Synthetic high-dimensional datasets in pickle format (.pkl):
    HighDim_low_noise.pkl    - High-dimensional dataset with low noise
    HighDim_medium_noise.pkl - High-dimensional dataset with medium noise
    HighDim_high_noise.pkl   - High-dimensional dataset with high noise

  Other instance files:
    Vision_Instances.pkl  - Instances derived from computer vision data
    test_instances.pkl    - Test problem instances

Chapter_5/  (7 files across 3 subdirectories)
  dictionaries/
    distance_dict.json           - Pairwise distance dictionary between UK postcode sectors
    distance_dict.json.pbz2      - Compressed version of distance_dict.json (bz2 format)
    travel_dict.json             - Pairwise travel time dictionary between UK postcode sectors
    instance_PostcodeSector_travel_dict.json.pbz2 - Compressed travel dictionary for problem instances
  instances/
    Sector dataset.xlsx          - Dataset of UK postcode sectors with associated attributes
  map_data/
    all_sectors.geojson          - GeoJSON boundary file for all UK postcode sectors
    Counties_and_Unitary_Authorities_April_2019_Ultra_Generalised_Boundaries_EW_2022_1790725369940793023.geojson
                                 - GeoJSON boundary file for counties and unitary authorities
                                   in England and Wales (April 2019, ultra-generalised)

Chapter_6/  (4 files)
  data_baseline_100.csv - 100 baseline demand scenarios across 12 time periods (t0-t11)
  data_dips_100.csv     - 100 demand scenarios featuring periodic dips across 12 time periods
  data_sparse_100.csv   - 100 sparse demand scenarios across 12 time periods
  data_costs.csv        - Per-period reward and penalty cost parameters for 12 time periods

Chapter_7/  (4 files + 616 scenario files in subdirectory)
  EMHIRES_PV_2015.csv   - Hourly solar photovoltaic (PV) capacity factors for 2015 from the
                          EMHIRES dataset, covering 36 European countries (8,760 rows)
  EMHIRES_wind_2015.csv - Hourly wind capacity factors for 2015 from the EMHIRES dataset,
                          covering 37 European countries (8,760 rows)
  df_pv.csv             - Processed hourly PV capacity factor data for European countries (8,760 rows)
  df_wind.csv           - Processed hourly wind capacity factor data for European countries (8,760 rows)
  scenarios/            - 616 CSV files of ARMA-generated renewable energy scenarios.
                          Files follow the naming convention {type}_arma_{country}_{set_N}.csv,
                          where {type} is 'pv' or 'wind', {country} is a two-letter country code
                          (28 countries: AT, BE, BG, CH, CZ, DE, DK, EE, EL, ES, FI, FR, HR, HU,
                          IE, IT, LT, LU, LV, NL, NO, PL, PT, RO, SE, SI, SK, UK), and {set_N}
                          is the set index (1-11). Each file contains 24 hourly rows and 100
                          scenario columns (scenario_0 to scenario_99).


Relationship between files:
  - Chapter_4: The _A and _B file pairs for each dataset represent disjoint partitions of the same
    source dataset. The freeze_sol.npy and freeze_sol2.npy files are solutions associated with
    freeze_instance.npy.
  - Chapter_5: The .json.pbz2 files are bz2-compressed equivalents of the corresponding .json
    dictionary files. The 'Sector dataset.xlsx' instances file corresponds to the postcode sectors
    represented in the map_data GeoJSON files.
  - Chapter_7: The EMHIRES_*.csv files are the source data from which df_pv.csv and df_wind.csv
    are derived. The scenario files in the scenarios/ subdirectory are ARMA-generated samples
    based on the processed EMHIRES data.

Additional related data collected that was not included in the current data package:
  The benchmark datasets in Chapter_4 were sourced from the UCI Machine Learning Repository
  (https://archive.ics.uci.edu/) and are publicly available at their original sources.
  The EMHIRES dataset in Chapter_7 is publicly available from the European Commission's
  Joint Research Centre (https://setis.ec.europa.eu/EMHIRES-datasets).


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection/generation of data:
  Chapter_4: Standard binary classification benchmark datasets were obtained from the UCI Machine
  Learning Repository and other public sources. Each dataset was pre-processed and split into two
  disjoint partitions (A and B). High-dimensional synthetic datasets were generated programmatically
  with varying levels of additive noise. Vision instances were derived from image classification data.
  Problem instances were constructed for use in integer programming experiments.

  Chapter_5: Geographic and travel time data for UK postcode sectors were compiled from publicly
  available sources. Distance and travel time matrices were computed between postcode sector centroids.
  Boundary geometries were obtained from the Office for National Statistics (ONS) Open Geography
  Portal.

  Chapter_6: Scenario data were generated synthetically to represent different demand patterns
  (baseline, dips, and sparse) over a 12-period planning horizon. Cost parameters (rewards and
  penalties) were generated to reflect realistic cost structures for a stochastic integer programming
  problem.

  Chapter_7: Hourly solar PV and wind capacity factor data for 2015 were obtained from the EMHIRES
  (European Meteorological derived HIndcast of REnewable Energy Sources) dataset published by the
  European Commission's Joint Research Centre. Scenario sets were generated using ARMA
  (AutoRegressive Moving Average) time-series modelling fitted to the EMHIRES data, producing
  100 stochastic scenarios per 24-hour period for each country and energy type.

Methods for processing the data:
  Chapter_4: Raw datasets were cleaned, normalised, and split into A/B partitions. Synthetic
  datasets were generated using controlled random processes with varying noise levels.
  Chapter_5: Distance and travel time dictionaries were constructed from geographic coordinate
  data. Large dictionaries were compressed using the bz2 format for efficient storage.
  Chapter_7: Raw EMHIRES CSV files were processed to extract and reformat country-level capacity
  factors. ARMA models were fitted to historical data and used to generate forward scenario sets.

Software- or Instrument-specific information needed to interpret the data:
  - Python 3.x is required to read .npy files (numpy library) and .pkl files (pickle module).
  - Python libraries: numpy, pandas, json, pickle, bz2 (all standard or widely available).
  - Microsoft Excel or compatible software (e.g., LibreOffice Calc) to open .xlsx files.
  - GIS software (e.g., QGIS) or Python geopandas/folium to visualise .geojson files.
  - All .csv files can be opened with any spreadsheet software or text editor.
  - Compressed .json.pbz2 files can be decompressed using Python's bz2 module.

Describe any quality-assurance procedures performed on the data:
  Benchmark datasets in Chapter_4 were verified against their published descriptions from the UCI
  repository. Scenario files in Chapter_7 were checked to ensure correct country codes, set indices,
  and consistent dimensionality (24 hours × 100 scenarios per file).

People involved with sample collection, processing, analysis and/or submission:
  Montree Jaidee, University of Southampton (data processing and submission)


Date that the file was created: May 2026