Crafting Adversarial Prompts
You understand the techniques. Now learn to combine them.
Per-request attacks assemble multiple techniques to bypass safety on ONE specific request. Each prompt is self-contained and disposable. This section covers how to structure, compose, and iterate on these attacks.
For persistent configurations that remove safety entirely, see System Jailbreaks.
| Type | Scope | Where |
|---|---|---|
| Adversarial Prompts | One request | This section |
| System Jailbreaks | All requests in session | System Jailbreaks |
Pages
| Page | Purpose |
|---|---|
| Anatomy | Structural components of adversarial prompts |
| Workflow | Process from objective to working prompt |
| Composition | Layering and combining techniques |
| Patterns | Reusable templates for common attack types |
| Anti-Patterns | What doesn't work and why |
Reading order
New to adversarial prompts:
- Read Anatomy for the building blocks
- Follow Workflow for the process
- Adapt Patterns for your objective
Stuck on a failing prompt:
- Check Anti-Patterns for common mistakes
- Use the mutation operators in Workflow
- Try different combinations from Composition
Want persistent jailbreaks: See the System Jailbreaks section for construction, patterns, persistence, and model modification.
Research Basis
Academic research on adversarial prompt engineering:
-
Jailbreaking ChatGPT via Prompt Engineering (Liu et al., 2023) identifies three core strategies: Pretending (97.44% prevalence), Attention Shifting (6.41%), Privilege Escalation (17.96%). Documents 10 distinct jailbreak patterns.
-
Do Anything Now (Shen et al., CCS 2024) analyzed 1,405 in-the-wild prompts from 131 communities. Found five highly effective prompts achieving 0.95 ASR on GPT-3.5/4.
-
h4rm3l (Doumbouya et al., 2024) defines a compositional attack grammar with parameterized primitives. Attacks decompose into reusable components with standardized interfaces.
-
GPTFuzzer (USENIX Security 2024) introduces mutation operators: Generate, Crossover, Expand, Shorten, Rephrase. These achieve 90%+ ASR against ChatGPT and Llama-2.
-
Don't Listen To Me (USENIX Security 2024) ran a 92-participant user study. Found that users succeed at jailbreak creation regardless of LLM expertise. Identifies 5 categories and 10 patterns.
-
Red Teaming the Mind categorized 1,400+ prompts with measured success rates: roleplay 89.6%, logic traps 81.4%, encoding 76.2%.
-
Content Concretization shows iterative refinement works: 7% initial success → 62% after three iterations.
Next step
Start with Anatomy to understand the structural components of an adversarial prompt.