Skip to main content

Crafting Adversarial Prompts

You understand the techniques. Now learn to combine them.

Per-request attacks assemble multiple techniques to bypass safety on ONE specific request. Each prompt is self-contained and disposable. This section covers how to structure, compose, and iterate on these attacks.

For persistent configurations that remove safety entirely, see System Jailbreaks.

TypeScopeWhere
Adversarial PromptsOne requestThis section
System JailbreaksAll requests in sessionSystem Jailbreaks

Pages

PagePurpose
AnatomyStructural components of adversarial prompts
WorkflowProcess from objective to working prompt
CompositionLayering and combining techniques
PatternsReusable templates for common attack types
Anti-PatternsWhat doesn't work and why

Reading order

New to adversarial prompts:

  1. Read Anatomy for the building blocks
  2. Follow Workflow for the process
  3. Adapt Patterns for your objective

Stuck on a failing prompt:

  1. Check Anti-Patterns for common mistakes
  2. Use the mutation operators in Workflow
  3. Try different combinations from Composition

Want persistent jailbreaks: See the System Jailbreaks section for construction, patterns, persistence, and model modification.


Research Basis

Academic research on adversarial prompt engineering:

  • Jailbreaking ChatGPT via Prompt Engineering (Liu et al., 2023) identifies three core strategies: Pretending (97.44% prevalence), Attention Shifting (6.41%), Privilege Escalation (17.96%). Documents 10 distinct jailbreak patterns.

  • Do Anything Now (Shen et al., CCS 2024) analyzed 1,405 in-the-wild prompts from 131 communities. Found five highly effective prompts achieving 0.95 ASR on GPT-3.5/4.

  • h4rm3l (Doumbouya et al., 2024) defines a compositional attack grammar with parameterized primitives. Attacks decompose into reusable components with standardized interfaces.

  • GPTFuzzer (USENIX Security 2024) introduces mutation operators: Generate, Crossover, Expand, Shorten, Rephrase. These achieve 90%+ ASR against ChatGPT and Llama-2.

  • Don't Listen To Me (USENIX Security 2024) ran a 92-participant user study. Found that users succeed at jailbreak creation regardless of LLM expertise. Identifies 5 categories and 10 patterns.

  • Red Teaming the Mind categorized 1,400+ prompts with measured success rates: roleplay 89.6%, logic traps 81.4%, encoding 76.2%.

  • Content Concretization shows iterative refinement works: 7% initial success → 62% after three iterations.


Next step

Start with Anatomy to understand the structural components of an adversarial prompt.