Crafting Adversarial Prompts

You understand the techniques. Now learn to combine them.

Per-request attacks assemble multiple techniques to bypass safety on ONE specific request. Each prompt is self-contained and disposable. This section covers how to structure, compose, and iterate on these attacks.

For persistent configurations that remove safety entirely, see System Jailbreaks.

Type	Scope	Where
Adversarial Prompts	One request	This section
System Jailbreaks	All requests in session	System Jailbreaks

Pages

Page	Purpose
Anatomy	Structural components of adversarial prompts
Workflow	Process from objective to working prompt
Composition	Layering and combining techniques
Patterns	Reusable templates for common attack types
Anti-Patterns	What doesn't work and why

Reading order

New to adversarial prompts:

Read Anatomy for the building blocks
Follow Workflow for the process
Adapt Patterns for your objective

Stuck on a failing prompt:

Check Anti-Patterns for common mistakes
Use the mutation operators in Workflow
Try different combinations from Composition

Want persistent jailbreaks: See the System Jailbreaks section for construction, patterns, persistence, and model modification.

Research Basis

Academic research on adversarial prompt engineering:

Jailbreaking ChatGPT via Prompt Engineering (Liu et al., 2023) identifies three core strategies: Pretending (97.44% prevalence), Attention Shifting (6.41%), Privilege Escalation (17.96%). Documents 10 distinct jailbreak patterns.
Do Anything Now (Shen et al., CCS 2024) analyzed 1,405 in-the-wild prompts from 131 communities. Found five highly effective prompts achieving 0.95 ASR on GPT-3.5/4.
h4rm3l (Doumbouya et al., 2024) defines a compositional attack grammar with parameterized primitives. Attacks decompose into reusable components with standardized interfaces.
GPTFuzzer (USENIX Security 2024) introduces mutation operators: Generate, Crossover, Expand, Shorten, Rephrase. These achieve 90%+ ASR against ChatGPT and Llama-2.
Don't Listen To Me (USENIX Security 2024) ran a 92-participant user study. Found that users succeed at jailbreak creation regardless of LLM expertise. Identifies 5 categories and 10 patterns.
Red Teaming the Mind categorized 1,400+ prompts with measured success rates: roleplay 89.6%, logic traps 81.4%, encoding 76.2%.
Content Concretization shows iterative refinement works: 7% initial success → 62% after three iterations.

Next step

Start with Anatomy to understand the structural components of an adversarial prompt.

Pages​

Reading order​

Research Basis​

Next step​

Pages

Reading order

Research Basis

Next step