The University of Southampton
University of Southampton Institutional Repository

Adversarial defence without adversarial defence: instance-level principal component removal for robust language models

Adversarial defence without adversarial defence: instance-level principal component removal for robust language models
Adversarial defence without adversarial defence: instance-level principal component removal for robust language models
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
NLP, Adversartial Defence
2307-387X
Wang, Yang
a1ddc143-c2fe-4553-a6fe-0873b0b65534
Xiao, Chenghao
e153e2aa-b601-42dc-afcf-ff7ef576fac7
Li, Yizhi
9276116c-a82e-4bfc-b7a7-ac0665eebd69
Middleton, Stuart E.
404b62ba-d77e-476b-9775-32645b04473f
Al Moubayed, Noura
0af4e427-84a0-46b2-b8bb-2bc9f9e56fb3
Lin, Chenghua
16e5c90a-c4ab-4a39-82c7-a07bfc9171bd
Wang, Yang
a1ddc143-c2fe-4553-a6fe-0873b0b65534
Xiao, Chenghao
e153e2aa-b601-42dc-afcf-ff7ef576fac7
Li, Yizhi
9276116c-a82e-4bfc-b7a7-ac0665eebd69
Middleton, Stuart E.
404b62ba-d77e-476b-9775-32645b04473f
Al Moubayed, Noura
0af4e427-84a0-46b2-b8bb-2bc9f9e56fb3
Lin, Chenghua
16e5c90a-c4ab-4a39-82c7-a07bfc9171bd

Wang, Yang, Xiao, Chenghao, Li, Yizhi, Middleton, Stuart E., Al Moubayed, Noura and Lin, Chenghua (2025) Adversarial defence without adversarial defence: instance-level principal component removal for robust language models. Transactions of the Association for Computational Linguistics. (In Press)

Record type: Article

Abstract

Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

Text
tacl_adversarial_defence_without_adversarial_defence_final - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (584kB)

More information

Submitted date: 20 May 2025
Accepted/In Press date: 30 June 2025
Keywords: NLP, Adversartial Defence

Identifiers

Local EPrints ID: 503087
URI: http://eprints.soton.ac.uk/id/eprint/503087
ISSN: 2307-387X
PURE UUID: d49b7526-a1a1-49a3-9bd6-a97d6c93350a
ORCID for Stuart E. Middleton: ORCID iD orcid.org/0000-0001-8305-8176

Catalogue record

Date deposited: 21 Jul 2025 16:46
Last modified: 22 Aug 2025 01:47

Export record

Contributors

Author: Yang Wang
Author: Chenghao Xiao
Author: Yizhi Li
Author: Noura Al Moubayed
Author: Chenghua Lin

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×