READ ME File For (ACSIncome, Healthcare and TPC-H) Datasets Dataset DOI: https://doi.org/10.5258/SOTON/D3798 Licence Information: CC-BY ReadMe Author: Shatha Algarni, University of Southampton ORCID ID: https://orcid.org/0009-0004-9831-6034 These datasets support the thesis entitled: Efficient Query Repair for Aggregate Constraints AWARDED BY: University of Southampton DATE OF AWARD: [2026] -------------------- DATA & FILE OVERVIEW -------------------- This project uses two real-world fairness benchmarks and one standard database benchmark to evaluate fairness-aware and constraint-aware query repair methods. All datasets are widely used in prior research and represent realistic, heterogeneous data. Each dataset is provided as structured tabular data and stored in CSV format, which is a standard representation for census, healthcare, and benchmark analytical workloads . 1. ACSIncome Dataset [ACSIncome_state_number1.csv.zip] ACSIncome is derived from the Adult Census Income data. It contains individual-level records from the 1994 United States Census and is commonly used in fairness and bias evaluation studies. It contains 14 attributes describing demographic and employment characteristics such as age, gender, race, education, working hours, occupation, and income level. This dataset is commonly used to assess bias in socioeconomic decision-making and serves as a representative example of fairness critical applications where selection conditions can reflect real hiring or income-related disparities. Source: Friedler et al., A Comparative Study of Fairness-Enhancing Interventions in Machine Learning, FAT* 2019 Paper: https://arxiv.org/abs/1802.04422 2. Healthcare Dataset [healthcare_800_numerical.csv.zip] The Healthcare dataset simulates medical decision-making scenarios with features such as income, number of children, county, and health complications, allowing the evaluation of fairness and robustness in prescreening queries applied to healthcare contexts. These two datasets collectively capture distinct fairness domains, socioeconomic decision making and healthcare screening, making them suitable and representative benchmarks for evaluating fairness query repair. Source: Grabberger et al., Fairness in Data Management Systems, 2021 Paper: https://arxiv.org/abs/2106.12588 3. TPC-H Benchmark [TPCH.zip] In addition to fairness-focused datasets, this project uses the TPC-H benchmark to demonstrate that the proposed query repair techniques generalize beyond demographic fairness scenarios. Unlike the fairness-focused datasets, TPC-H models business and supply-chain data involving parts, suppliers, and nations, and is used to test aggregate constraints (e.g., revenue bounds) rather than demographic fairness. This dataset is included to demonstrate that the proposed query repair is not limited to fairness scenarios but can be generalised to other domains, such as business analytics and supply-chain optimisation. All categorical columns are converted to numeric values, as the algorithms are designed for numerical data. Source: Transaction Processing Performance Council (TPC) Official specification: https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp -------------------------- METHODOLOGICAL INFORMATION -------------------------- All categorical attributes across datasets are converted into numerical representations, as the implemented algorithms operate exclusively on numerical data.