READ ME File for "Dataset for Urban Noise Prediction" Dataset DOI: https://doi.org/10.5258/SOTON/D3762 ReadMe Author: Feiyu Zhu (University of Southampton) ORCID: https://orcid.org/0009-0001-5496-349X Contact: [your_email@soton.ac.uk] This dataset supports the PhD thesis: Zhu, F. (2025). "Assessing the feasibility of applying machine learning tools to predict environmental stressors from digital urban fingerprints." PhD thesis, University of Southampton. AWARDED BY: University of Southampton DATE OF AWARD: 2025 Date of data collection: 2012–2023 (depending on source; see “Data derived from other sources”) Information about geographic location of data collection: Southampton, Cardiff, Portsmouth, Liverpool, Nottingham, United Kingdom Licence ------- This dataset is released under the Creative Commons Attribution licence (CC BY 4.0). Users are free to share and adapt the data, provided appropriate credit is given to the author, a link to the licence is provided, and any changes are indicated. Related projects / Funders -------------------------- No external funding was associated with the creation of this dataset. -------------------- DATA & FILE OVERVIEW -------------------- This repository contains the code and derived datasets used in the thesis to develop, evaluate, and generalise deep-learning models for urban environmental noise prediction across multiple UK cities. The project corresponds to the public GitHub repository described in the thesis abstract, and is structured to mirror the three data-driven chapters of the dissertation. Root directory structure ------------------------ At the root level there are four folders: 1. chapter4_efficientnet/ 2. chapter5_gnn_southampton/ 3. chapter6_crosscity_generation/ 4. data_source/ Each chapter folder includes: - a chapter-specific README.md describing the research objectives and experiments; - Python scripts for data preparation, model training, inference, and evaluation; - a results/ folder containing predicted strategic noise maps in Shapefile (.shp) and/or raster (e.g., GeoTIFF) formats. This organisational principle is consistent across Chapters 4–6. Chapter-specific contents ------------------------- chapter4_efficientnet/ - CNN baseline modelling using EfficientNet architectures. - Scripts include training and prediction workflows. - results/ contains Southampton noise prediction maps. chapter5_gnn_southampton/ - Graph neural network modelling on high-resolution (4 m) grids in Southampton. - Scripts include graph construction, training, prediction, and spatial evaluation. - results/ contains Southampton GNN prediction products. chapter6_crosscity_generation/ - Cross-city generalisation framework (dual-branch GNN + domain alignment). - Code is more extensive and includes: • image preprocessing and normalisation • domain alignment / adaptation modules • integrated dataset generation across cities • feature map construction • model training and inference • pseudo-label generation for target cities - results/ includes cross-city prediction maps for multiple target cities. Data sources folder ------------------- data_source/ contains the primary spatial inputs used across all three chapters: - WorldView-2 multispectral remote sensing imagery for five UK cities (Southampton, Cardiff, Portsmouth, Liverpool, Nottingham). - Urban Atlas 2012 land-use / land-cover (LULC) vector datasets for the same cities. - END4 strategic traffic noise maps for four cities (all except Cardiff, which is based on in situ measurements). Additionally, this folder includes: - the integrated cross-city dataset used in Chapter 6; - pseudo-label datasets generated for four target cities as described in Chapter 6. Southampton measured noise observations are not included in this repository because the measurement campaign originates from Alvares-Sanches et al. and is subject to third-party data ownership. Relationship between files -------------------------- - data_source/ provides raw spatial inputs (imagery, LULC, END4 maps) and the derived Chapter 6 cross-city datasets and pseudo-labels. - chapter4_efficientnet/ and chapter5_gnn_southampton/ each construct modelling datasets from Southampton observations and spatial predictors, then train and evaluate the chapter-specific models. - chapter6_crosscity_generation/ builds multi-city datasets from data_source/, applies domain alignment and pseudo-labelling, and generates prediction maps for cities without direct measurements. - results/ folders across chapters provide final predicted noise maps used in the thesis figures. Additional related data not included in this package --------------------------------------------------- - Southampton measured noise points derived from Alvares-Sanches et al. are not redistributed here. Users wishing to reproduce Southampton measurement-based experiments should request access from the original authors or use alternative public measurements where available. Data derived from other sources ------------------------------- This project uses multiple third-party datasets: - WorldView-2 European Cities Archieve for five UK cities (ESA, licensed remote sensing data). - Urban Atlas 2012 LULC vector products (Copernicus/EEA, open policy-compliant use). - END4 strategic traffic noise maps for four cities (publicly released environmental noise products). Users should cite original data providers in any publication derived from this dataset. Multiple versions of the dataset -------------------------------- This is the first and only public version of the dataset accompanying the PhD thesis (Version 1.0, Nov 2025). Future updates, if any, will be versioned through the DOI record. -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used for collection / generation of data --------------------------------------------------------------- Although the project contains substantial chapter-specific details, the overall workflow follows a standard deep-learning pipeline for environmental prediction: 1. Spatial predictors were derived from multispectral remote sensing imagery and/or LULC vector data. 2. These predictors were matched to noise observations (where available) to construct modelling datasets at multiple spatial resolutions (e.g., 30 m CNN patches and 4 m GNN grids). 3. Deep-learning models were trained using the constructed datasets. 4. Trained models were applied at full-city scale or transferred to other cities without direct measurements to generate strategic noise prediction maps. Full methodological detail is provided in Chapters 4–6 of the thesis. Methods for processing the data ------------------------------- Key processing steps include: - Remote sensing preprocessing (radiometric correction, mosaicking, resampling, tiling). - Spatial harmonisation of imagery, LULC polygons, and noise layers to common projections and modelling grids. - Feature extraction from WV-2 imagery and LULC metrics at multiple scales. - Dataset construction for CNN and GNN workflows. - Cross-city domain alignment and pseudo-label generation (Chapter 6). - Export of city-wide prediction products for post-processing and cartographic analysis. Software / Instrument-specific information ------------------------------------------ Preprocessing and GIS: - Orfeo Toolbox (remote sensing preprocessing). - ArcGIS Pro 2.7 (spatial processing, map export, analysis). - ENVI 5.7 (image correction and feature support). Python environment: - Python 3.8 - rasterio 1.3.2 - geopandas 0.13.2 - plus standard ML and scientific libraries. Model training and inference: - Conducted in Google Colab GPU environments. - TensorFlow 2.7 (CNN baseline training). - PyTorch 2.8.0 - torch_geometric 0.4.0 (GNN and cross-city modelling). Prediction outputs were exported to ArcGIS Pro for spatial evaluation, visualisation, and figure preparation. Standards and calibration information ------------------------------------- Not applicable (data products are derived from established remote sensing and GIS processing standards and publicly available noise products). Environmental / experimental conditions --------------------------------------- Not applicable (no field experiments were conducted directly by the author in this dataset; all measurement data included are derived or model-generated). Quality-assurance procedures ---------------------------- - Consistency checks of projections, grid alignment, and spatial extents across cities. - Feature sanity checks (range validation, missing-value screening). - Cross-validation and spatial leakage-avoidance splits as documented in the thesis. - Visual inspection of predicted and intermediate maps prior to final export. People involved --------------- Data curation and submission: - Feiyu Zhu (University of Southampton) [Add any additional co-authors here if required.] -------------------------- LEGAL & ETHICS INFORMATION -------------------------- Ethical approval: No ethical approval was required for this dataset, as it does not involve human participants or personal / identifiable data. Personal data: This dataset contains no personally identifiable information. All spatial layers describe environmental exposures and urban form. Southampton measurement data are not redistributed due to third-party ownership. -------------------------- DATA-SPECIFIC INFORMATION -------------------------- Given the multi-chapter structure, data-specific details are documented within each chapter folder README.md. The key files include: - Chapter-level training / inference scripts (Python). - Chapter-level processed datasets and feature tables (as described per chapter). - Chapter-level results/ folders containing predicted noise maps in .shp and/or raster formats. Users should refer to: - chapter4_efficientnet/README.md - chapter5_gnn_southampton/README.md - chapter6_crosscity_generation/README.md for file-by-file variable definitions, model inputs/outputs, and experiment descriptions. Date that the file was created: November 2025