Adversarial defence without adversarial defence: instance-level principal component removal for robust language models
Adversarial defence without adversarial defence: instance-level principal component removal for robust language models
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
NLP, Adversartial Defence
Wang, Yang
a1ddc143-c2fe-4553-a6fe-0873b0b65534
Xiao, Chenghao
e153e2aa-b601-42dc-afcf-ff7ef576fac7
Li, Yizhi
9276116c-a82e-4bfc-b7a7-ac0665eebd69
Middleton, Stuart E.
404b62ba-d77e-476b-9775-32645b04473f
Al Moubayed, Noura
0af4e427-84a0-46b2-b8bb-2bc9f9e56fb3
Lin, Chenghua
16e5c90a-c4ab-4a39-82c7-a07bfc9171bd
Wang, Yang
a1ddc143-c2fe-4553-a6fe-0873b0b65534
Xiao, Chenghao
e153e2aa-b601-42dc-afcf-ff7ef576fac7
Li, Yizhi
9276116c-a82e-4bfc-b7a7-ac0665eebd69
Middleton, Stuart E.
404b62ba-d77e-476b-9775-32645b04473f
Al Moubayed, Noura
0af4e427-84a0-46b2-b8bb-2bc9f9e56fb3
Lin, Chenghua
16e5c90a-c4ab-4a39-82c7-a07bfc9171bd
Wang, Yang, Xiao, Chenghao, Li, Yizhi, Middleton, Stuart E., Al Moubayed, Noura and Lin, Chenghua
(2025)
Adversarial defence without adversarial defence: instance-level principal component removal for robust language models.
Transactions of the Association for Computational Linguistics.
(In Press)
Abstract
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
Text
tacl_adversarial_defence_without_adversarial_defence_final
- Accepted Manuscript
More information
Submitted date: 20 May 2025
Accepted/In Press date: 30 June 2025
Keywords:
NLP, Adversartial Defence
Identifiers
Local EPrints ID: 503087
URI: http://eprints.soton.ac.uk/id/eprint/503087
ISSN: 2307-387X
PURE UUID: d49b7526-a1a1-49a3-9bd6-a97d6c93350a
Catalogue record
Date deposited: 21 Jul 2025 16:46
Last modified: 22 Aug 2025 01:47
Export record
Contributors
Author:
Yang Wang
Author:
Chenghao Xiao
Author:
Yizhi Li
Author:
Noura Al Moubayed
Author:
Chenghua Lin
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics