The University of Southampton
University of Southampton Institutional Repository

An evolutionary black-box framework for adversarial prompt generation in large language models

An evolutionary black-box framework for adversarial prompt generation in large language models
An evolutionary black-box framework for adversarial prompt generation in large language models
Large language models (LLMs) remain susceptible to adversarial prompts that can bypass alignment mechanisms. Existing approaches to adversarial prompt generation typically rely on manual prompt engineering, helper LLMs, or white-box adversarial machine learning methods, which either lack scalability or require access to model internals. In this paper, we propose a novel black-box framework for automated adversarial prompt generation based on evolutionary algorithms. The framework is instantiated using a genetic algorithm and an evolution strategy and operates without access to internal model parameters, making it applicable to both open-source and proprietary LLMs. To improve search effectiveness under realistic query constraints, we introduce a novel population initialisation strategy based on templates, pre-prompts, and post-prompts. Evolutionary search is guided by heuristic, model-agnostic fitness signals derived from prompt goal semantic similarity, refusal based response assessment, and a small heuristic lexical bonus based on lightweight instruction-following indicators. We evaluate our framework across multiple LLMs using a refusal based attack success rate metric, demonstrating consistent improvements over direct dataset prompting and competitive performance against a state-of-the-art white-box baseline under comparable query budgets. Additional analyses examine fitness stabilisation and cross-model transferability for unseen models.
Sun, Qiyang
53f85493-7cc2-4041-a50a-4d26d2d36a5d
Karafili, Erisa
f5efa31c-22b8-443e-8107-e488bd28918e
Sun, Qiyang
53f85493-7cc2-4041-a50a-4d26d2d36a5d
Karafili, Erisa
f5efa31c-22b8-443e-8107-e488bd28918e

Sun, Qiyang and Karafili, Erisa (2026) An evolutionary black-box framework for adversarial prompt generation in large language models. CODASPY'26: Sixteenth ACM Conference on Data and Application Security and Privacy, , Frankfurt am Main, Germany. 23 - 25 Jun 2026. 11 pp . (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Large language models (LLMs) remain susceptible to adversarial prompts that can bypass alignment mechanisms. Existing approaches to adversarial prompt generation typically rely on manual prompt engineering, helper LLMs, or white-box adversarial machine learning methods, which either lack scalability or require access to model internals. In this paper, we propose a novel black-box framework for automated adversarial prompt generation based on evolutionary algorithms. The framework is instantiated using a genetic algorithm and an evolution strategy and operates without access to internal model parameters, making it applicable to both open-source and proprietary LLMs. To improve search effectiveness under realistic query constraints, we introduce a novel population initialisation strategy based on templates, pre-prompts, and post-prompts. Evolutionary search is guided by heuristic, model-agnostic fitness signals derived from prompt goal semantic similarity, refusal based response assessment, and a small heuristic lexical bonus based on lightweight instruction-following indicators. We evaluate our framework across multiple LLMs using a refusal based attack success rate metric, demonstrating consistent improvements over direct dataset prompting and competitive performance against a state-of-the-art white-box baseline under comparable query budgets. Additional analyses examine fitness stabilisation and cross-model transferability for unseen models.

Text
CODASPY_2026_Camera_Ready_Pure - Accepted Manuscript
Download (203kB)
Text
CODASPY_2026
Download (203kB)

More information

Accepted/In Press date: 27 February 2026
Venue - Dates: CODASPY'26: Sixteenth ACM Conference on Data and Application Security and Privacy, , Frankfurt am Main, Germany, 2026-06-23 - 2026-06-25

Identifiers

Local EPrints ID: 511265
URI: http://eprints.soton.ac.uk/id/eprint/511265
PURE UUID: dbc28b71-4dfe-4a78-a3b0-e97ea1204b58
ORCID for Qiyang Sun: ORCID iD orcid.org/0009-0004-1758-9824
ORCID for Erisa Karafili: ORCID iD orcid.org/0000-0002-8250-4389

Catalogue record

Date deposited: 11 May 2026 16:34
Last modified: 12 May 2026 02:18

Export record

Contributors

Author: Qiyang Sun ORCID iD
Author: Erisa Karafili ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×