READ ME File for "Dataset for Urban Noise Prediction"

Dataset DOI: https://doi.org/10.5258/SOTON/D3762

ReadMe Author: Feiyu Zhu (University of Southampton)
ORCID: https://orcid.org/0009-0001-5496-349X
Contact: [your_email@soton.ac.uk]

This dataset supports the PhD thesis:

    Zhu, F. (2025). "Assessing the feasibility of applying machine learning tools 
    to predict environmental stressors from digital urban fingerprints."
    PhD thesis, University of Southampton.

AWARDED BY: University of Southampton  
DATE OF AWARD: 2025  

Date of data collection: 2012–2023 (depending on source; see “Data derived from other sources”)  

Information about geographic location of data collection:  
Southampton, Cardiff, Portsmouth, Liverpool, Nottingham, United Kingdom


Licence
-------

This dataset is released under the Creative Commons Attribution licence (CC BY 4.0).  
Users are free to share and adapt the data, provided appropriate credit is given to the author, 
a link to the licence is provided, and any changes are indicated.


Related projects / Funders
--------------------------

No external funding was associated with the creation of this dataset.  


--------------------
DATA & FILE OVERVIEW
--------------------

This repository contains the code and derived datasets used in the thesis to develop,
evaluate, and generalise deep-learning models for urban environmental noise prediction
across multiple UK cities. The project corresponds to the public GitHub repository
described in the thesis abstract, and is structured to mirror the three data-driven
chapters of the dissertation.

Root directory structure
------------------------

At the root level there are four folders:

1. chapter4_efficientnet/  
2. chapter5_gnn_southampton/  
3. chapter6_crosscity_generation/  
4. data_source/

Each chapter folder includes:
- a chapter-specific README.md describing the research objectives and experiments;  
- Python scripts for data preparation, model training, inference, and evaluation;  
- a results/ folder containing predicted strategic noise maps in Shapefile (.shp)
  and/or raster (e.g., GeoTIFF) formats.

This organisational principle is consistent across Chapters 4–6.

Chapter-specific contents
-------------------------

chapter4_efficientnet/  
- CNN baseline modelling using EfficientNet architectures.  
- Scripts include training and prediction workflows.  
- results/ contains Southampton noise prediction maps.

chapter5_gnn_southampton/  
- Graph neural network modelling on high-resolution (4 m) grids in Southampton.  
- Scripts include graph construction, training, prediction, and spatial evaluation.  
- results/ contains Southampton GNN prediction products.

chapter6_crosscity_generation/  
- Cross-city generalisation framework (dual-branch GNN + domain alignment).  
- Code is more extensive and includes:
    • image preprocessing and normalisation  
    • domain alignment / adaptation modules  
    • integrated dataset generation across cities  
    • feature map construction  
    • model training and inference  
    • pseudo-label generation for target cities  
- results/ includes cross-city prediction maps for multiple target cities.

Data sources folder
-------------------

data_source/ contains the primary spatial inputs used across all three chapters:

- WorldView-2 multispectral remote sensing imagery for five UK cities  
  (Southampton, Cardiff, Portsmouth, Liverpool, Nottingham).

- Urban Atlas 2012 land-use / land-cover (LULC) vector datasets for the same cities.

- END4 strategic traffic noise maps for four cities  
  (all except Cardiff, which is based on in situ measurements).

Additionally, this folder includes:
- the integrated cross-city dataset used in Chapter 6;  
- pseudo-label datasets generated for four target cities as described in Chapter 6.

Southampton measured noise observations are not included in this repository because the
measurement campaign originates from Alvares-Sanches et al. and is subject to third-party
data ownership.


Relationship between files
--------------------------

- data_source/ provides raw spatial inputs (imagery, LULC, END4 maps) and the derived
  Chapter 6 cross-city datasets and pseudo-labels.

- chapter4_efficientnet/ and chapter5_gnn_southampton/ each construct modelling datasets
  from Southampton observations and spatial predictors, then train and evaluate the
  chapter-specific models.

- chapter6_crosscity_generation/ builds multi-city datasets from data_source/, applies
  domain alignment and pseudo-labelling, and generates prediction maps for cities without
  direct measurements.

- results/ folders across chapters provide final predicted noise maps used in the
  thesis figures.


Additional related data not included in this package
---------------------------------------------------

- Southampton measured noise points derived from Alvares-Sanches et al. are not redistributed
  here. Users wishing to reproduce Southampton measurement-based experiments should request
  access from the original authors or use alternative public measurements where available.


Data derived from other sources
-------------------------------

This project uses multiple third-party datasets:

- WorldView-2 European Cities Archieve for five UK cities (ESA, licensed remote sensing data).  
- Urban Atlas 2012 LULC vector products (Copernicus/EEA, open policy-compliant use).  
- END4 strategic traffic noise maps for four cities (publicly released environmental
  noise products).  

Users should cite original data providers in any publication derived from this dataset.


Multiple versions of the dataset
--------------------------------

This is the first and only public version of the dataset accompanying the PhD thesis
(Version 1.0, Nov 2025). Future updates, if any, will be versioned through the DOI record.



--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection / generation of data
---------------------------------------------------------------

Although the project contains substantial chapter-specific details, the overall workflow
follows a standard deep-learning pipeline for environmental prediction:

1. Spatial predictors were derived from multispectral remote sensing imagery and/or
   LULC vector data.  

2. These predictors were matched to noise observations (where available) to construct
   modelling datasets at multiple spatial resolutions (e.g., 30 m CNN patches and 4 m
   GNN grids).

3. Deep-learning models were trained using the constructed datasets.

4. Trained models were applied at full-city scale or transferred to other cities without
   direct measurements to generate strategic noise prediction maps.

Full methodological detail is provided in Chapters 4–6 of the thesis.


Methods for processing the data
-------------------------------

Key processing steps include:

- Remote sensing preprocessing (radiometric correction, mosaicking, resampling, tiling).  
- Spatial harmonisation of imagery, LULC polygons, and noise layers to common projections
  and modelling grids.  
- Feature extraction from WV-2 imagery and LULC metrics at multiple scales.  
- Dataset construction for CNN and GNN workflows.  
- Cross-city domain alignment and pseudo-label generation (Chapter 6).  
- Export of city-wide prediction products for post-processing and cartographic analysis.


Software / Instrument-specific information
------------------------------------------

Preprocessing and GIS:
- Orfeo Toolbox (remote sensing preprocessing).  
- ArcGIS Pro 2.7 (spatial processing, map export, analysis).  
- ENVI 5.7 (image correction and feature support).  

Python environment:
- Python 3.8  
- rasterio 1.3.2  
- geopandas 0.13.2  
- plus standard ML and scientific libraries.

Model training and inference:
- Conducted in Google Colab GPU environments.  
- TensorFlow 2.7 (CNN baseline training).  
- PyTorch 2.8.0  
- torch_geometric 0.4.0 (GNN and cross-city modelling).  

Prediction outputs were exported to ArcGIS Pro for spatial evaluation, visualisation,
and figure preparation.


Standards and calibration information
-------------------------------------

Not applicable (data products are derived from established remote sensing and GIS
processing standards and publicly available noise products).


Environmental / experimental conditions
---------------------------------------

Not applicable (no field experiments were conducted directly by the author in this dataset;
all measurement data included are derived or model-generated).


Quality-assurance procedures
----------------------------

- Consistency checks of projections, grid alignment, and spatial extents across cities.  
- Feature sanity checks (range validation, missing-value screening).  
- Cross-validation and spatial leakage-avoidance splits as documented in the thesis.  
- Visual inspection of predicted and intermediate maps prior to final export.


People involved
---------------

Data curation and submission:
    - Feiyu Zhu (University of Southampton)

[Add any additional co-authors here if required.]


--------------------------
LEGAL & ETHICS INFORMATION
--------------------------

Ethical approval:
    No ethical approval was required for this dataset, as it does not involve human
    participants or personal / identifiable data.

Personal data:
    This dataset contains no personally identifiable information. All spatial layers
    describe environmental exposures and urban form. Southampton measurement data are
    not redistributed due to third-party ownership.



--------------------------
DATA-SPECIFIC INFORMATION
--------------------------

Given the multi-chapter structure, data-specific details are documented within each
chapter folder README.md. The key files include:

- Chapter-level training / inference scripts (Python).  
- Chapter-level processed datasets and feature tables (as described per chapter).  
- Chapter-level results/ folders containing predicted noise maps in .shp and/or raster
  formats.  

Users should refer to:
- chapter4_efficientnet/README.md  
- chapter5_gnn_southampton/README.md  
- chapter6_crosscity_generation/README.md  

for file-by-file variable definitions, model inputs/outputs, and experiment descriptions.


Date that the file was created: November 2025