Adversarial defence without adversarial defence: instance-level principal component removal for robust language models

Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

NLP, Adversartial Defence

2307-387X

Wang, Yang

a1ddc143-c2fe-4553-a6fe-0873b0b65534

Xiao, Chenghao

e153e2aa-b601-42dc-afcf-ff7ef576fac7

Li, Yizhi

9276116c-a82e-4bfc-b7a7-ac0665eebd69

Middleton, Stuart E.

404b62ba-d77e-476b-9775-32645b04473f

Al Moubayed, Noura

0af4e427-84a0-46b2-b8bb-2bc9f9e56fb3

Lin, Chenghua

16e5c90a-c4ab-4a39-82c7-a07bfc9171bd

Wang, Yang

a1ddc143-c2fe-4553-a6fe-0873b0b65534

Xiao, Chenghao

e153e2aa-b601-42dc-afcf-ff7ef576fac7

Li, Yizhi

9276116c-a82e-4bfc-b7a7-ac0665eebd69

Middleton, Stuart E.

404b62ba-d77e-476b-9775-32645b04473f

Al Moubayed, Noura

0af4e427-84a0-46b2-b8bb-2bc9f9e56fb3

Lin, Chenghua

16e5c90a-c4ab-4a39-82c7-a07bfc9171bd

Wang, Yang, Xiao, Chenghao, Li, Yizhi, Middleton, Stuart E., Al Moubayed, Noura and Lin, Chenghua (2025) Adversarial defence without adversarial defence: instance-level principal component removal for robust language models. Transactions of the Association for Computational Linguistics. (In Press)

Record type: Article

Abstract

Text

tacl_adversarial_defence_without_adversarial_defence_final - Accepted Manuscript

Available under License Creative Commons Attribution.

Download (584kB)

More information

Submitted date: 20 May 2025

Accepted/In Press date: 30 June 2025

Keywords: NLP, Adversartial Defence

Learn more about Agents, Interactions and Complexity research

Identifiers

Local EPrints ID: 503087

URI: http://eprints.soton.ac.uk/id/eprint/503087

ISSN: 2307-387X

PURE UUID: d49b7526-a1a1-49a3-9bd6-a97d6c93350a

ORCID for Stuart E. Middleton:

orcid.org/0000-0001-8305-8176

Catalogue record

Date deposited: 21 Jul 2025 16:46

Last modified: 22 Aug 2025 01:47

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Yang Wang

Author: Chenghao Xiao

Author: Yizhi Li

Author: Stuart E. Middleton

Author: Noura Al Moubayed

Author: Chenghua Lin

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information