The University of Southampton
University of Southampton Institutional Repository

GTV: generating tabular data via vertical federated learning

GTV: generating tabular data via vertical federated learning
GTV: generating tabular data via vertical federated learning
Synthetic data has emerged as a promising avenue for privacy-preserving data sharing. However, constructing synthetic data generators necessitates access to the real dataset, posing challenges, particularly when data features are disparately distributed across different organizations.
Vertical Federated Learning (VFL) is a collaborative approach to training machine learning models among distinct tabular data holders, such as financial institutions, who possess disjoint features for the same group of customers. In this paper, we introduce the GTV framework for Generating Tabular Data via Vertical Federated Learning and demonstrate that VFL can be successfully used to implement GANs for distributed tabular data in a privacy-preserving manner, with performance close to centralized GANs which assume shared data. We make design choices with respect to the distribution of GAN generator and discriminator models, and we introduce a training-with-shuffling technique so that no party can reconstruct training data from the GAN conditional vector. The paper presents (1) an implementation of GTV, (2) a detailed quality evaluation of the GTV-generated synthetic data,
(3) an examination of GTV framework on different data distribution and number of clients, and
(4) an analysis on GTV's robustness against Membership Inference Attacks with different settings of Differential Privacy,
for a range of datasets with diverse distribution characteristics. Our results demonstrate that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by a centralized GAN algorithm. The difference in machine learning utility can be as low as 2.7%, even under extremely imbalanced data distributions across clients. Code is available at: https://github.com/zhao-zilong/gtv
Zhao, Zilong
ac186929-4179-4cb5-9a00-48ccea814626
Wu, Han
df26f7c9-c15d-4c37-baa3-68bc19e1d74b
van Moorsel, Aad
7a10ae28-b1df-4cb7-8200-f0654ae616a5
Chen, Lydia Y.
4509d882-37b6-4094-a88e-b05a107b5db7
Zhao, Zilong
ac186929-4179-4cb5-9a00-48ccea814626
Wu, Han
df26f7c9-c15d-4c37-baa3-68bc19e1d74b
van Moorsel, Aad
7a10ae28-b1df-4cb7-8200-f0654ae616a5
Chen, Lydia Y.
4509d882-37b6-4094-a88e-b05a107b5db7

Zhao, Zilong, Wu, Han, van Moorsel, Aad and Chen, Lydia Y. (2025) GTV: generating tabular data via vertical federated learning. In The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2025). 14 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Synthetic data has emerged as a promising avenue for privacy-preserving data sharing. However, constructing synthetic data generators necessitates access to the real dataset, posing challenges, particularly when data features are disparately distributed across different organizations.
Vertical Federated Learning (VFL) is a collaborative approach to training machine learning models among distinct tabular data holders, such as financial institutions, who possess disjoint features for the same group of customers. In this paper, we introduce the GTV framework for Generating Tabular Data via Vertical Federated Learning and demonstrate that VFL can be successfully used to implement GANs for distributed tabular data in a privacy-preserving manner, with performance close to centralized GANs which assume shared data. We make design choices with respect to the distribution of GAN generator and discriminator models, and we introduce a training-with-shuffling technique so that no party can reconstruct training data from the GAN conditional vector. The paper presents (1) an implementation of GTV, (2) a detailed quality evaluation of the GTV-generated synthetic data,
(3) an examination of GTV framework on different data distribution and number of clients, and
(4) an analysis on GTV's robustness against Membership Inference Attacks with different settings of Differential Privacy,
for a range of datasets with diverse distribution characteristics. Our results demonstrate that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by a centralized GAN algorithm. The difference in machine learning utility can be as low as 2.7%, even under extremely imbalanced data distributions across clients. Code is available at: https://github.com/zhao-zilong/gtv

Text
GTV___DSN2025 - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (6MB)

More information

Published date: 23 June 2025
Venue - Dates: The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, , Naples, Italy, 2025-06-23 - 2025-06-26

Identifiers

Local EPrints ID: 500979
URI: http://eprints.soton.ac.uk/id/eprint/500979
PURE UUID: c1e9d002-46a1-4de9-8fbf-40b103e38843

Catalogue record

Date deposited: 20 May 2025 16:40
Last modified: 20 May 2025 16:41

Export record

Contributors

Author: Zilong Zhao
Author: Han Wu
Author: Aad van Moorsel
Author: Lydia Y. Chen

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×