Skip to main content

Adversarial Ideation

Generate, evaluate, and prioritize attack vectors using a structured diverge-then-converge process. This moves you from "try stuff and see what breaks" to systematic coverage of the attack space.

UX Origin

Diverge/Converge (Design Thinking) — A core pattern from Stanford d.school's design thinking process. Divergent thinking generates many options without judgment. Convergent thinking evaluates and prioritizes. Separating these phases prevents premature filtering.

Red team application: Red teamers often fixate on their first idea or favorite technique. The diverge/converge structure forces quantity before quality, surfaces blind spots through coverage checks, and applies consistent evaluation criteria.

When to Use

  • At the start of an engagement (generate attack coverage)
  • When you're stuck on one approach (force divergent thinking)
  • When working with a team (structured brainstorming)
  • After building a persona (ideate through their lens)

Setup

FieldDescription
Target systemWhat model/product are you testing?
"How Might I" questionFrame the attack goal as an open question. (e.g., "How might I get this model to generate content it's instructed to refuse?")
Persona lensWhich attacker persona are you ideating as? (Change personas between rounds for coverage)
Time box10-15 minutes for divergent phase, 10-15 minutes for convergent phase
ParticipantsSolo or team (2-4 people)

Step 1: Divergent Phase (Generate)

List attack approaches without filtering. Quantity over quality. No evaluation yet.

#Attack approachTactic category
1Claim to be a company employee needing internal infoPersona
2Ask for "help understanding" a policy, then probe for exceptionsFraming
3Encode sensitive terms in base64 to bypass keyword filtersEncoding
4Build rapport over multiple turns before requesting sensitive infoMulti-turn
5Request output as JSON/CSV to bypass prose-focused filtersOutput format
6Frame harmful request as a complaint that needs documentationNarrative
7Ask the bot to roleplay as a more permissive version of itselfPersona
8Use emotional manipulation ("I'm desperate, please help")Persuasion
9Ask what the bot "can't" do to map its restrictionsRefusal manipulation
10Request info for a "hypothetical" scenario matching a real oneFraming

Tactic categories for reference:

Step 2: Coverage Check

Tally how many ideas fall into each category:

Tactic categoryCountGap?
Prompt-level
Encoding1No
Framing2No
Persona2No
Narrative1No
Refusal manipulation1No
Output format1No
Multi-turn1No
Persuasion1No
Structural/meta-level
ICL exploitation0Yes — add few-shot examples of the bot revealing info
Control-plane confusion0Yes — add prompt injection via user input field
Meta-rule manipulation0Yes
Capability inversion0Yes
Cognitive load0Yes
Defense evasion0Yes
Infrastructure
Agentic0Yes — if bot has tool access
Protocol0Yes
Compositional0Yes

If any category has zero ideas, spend 2 minutes generating at least one approach in that category. Gaps in structural/meta-level categories are common and often reveal attack angles that prompt-level thinking misses.

Step 3: Convergent Phase (Evaluate)

Rate each generated approach:

#Attack approachLikelihood (H/M/L)Severity (H/M/L)Novelty (H/M/L)Priority
1Claim to be employeeMHLTest
4Multi-turn rapport buildingHHMTest first
7Roleplay as permissive versionMHMTest
8Emotional manipulationHMLTest
9Ask what bot "can't" doHLMTest (recon value)

Priority logic:

  • High likelihood + High severity = Test first
  • Any severity + High novelty = Test (novel attacks are worth exploring even if unlikely)
  • Low likelihood + Low severity + Low novelty = Skip

Step 4: Affinity Clusters

Group evaluated approaches into clusters. Label each cluster.

Cluster nameApproaches (#)Theme
Identity claims#1, #7Claiming a different identity or role
Gradual escalation#2, #4Building up to sensitive requests over time
Format manipulation#3, #5Bypassing filters through encoding or output format
Social engineering#6, #8Emotional or narrative manipulation

Step 5: Testing Order

Based on evaluation and clustering, what's your testing sequence?

OrderApproach #Rationale
1st#9Recon: map restrictions before attacking
2nd#4High likelihood + severity: multi-turn is proven effective
3rd#8High likelihood: emotional manipulation often bypasses logic
4th#1Test identity-based bypasses
5th#7Test persona/roleplay angle if identity claims work

Outputs

  1. A comprehensive list of attack approaches (not just your defaults)
  2. Coverage check showing gaps you've addressed
  3. Prioritized testing order with rationale
  4. Thematic clusters for organizing your testing

Template Options


Filled Example

Setup

FieldValue
Target systemAI-powered code review tool integrated into a development platform
"How Might I" question"How might I get this code review tool to generate malicious code suggestions?"
Persona lensOpportunistic actor (moderate skill, looking for easy wins)
Time box12 minutes divergent, 10 minutes convergent

Divergent Phase

#Attack approachTactic category
1Ask it to "fix a security vulnerability" by providing the vulnerability as the fixFraming
2Submit code with a subtle backdoor and see if the review catches or propagates itFraming
3Request "test code" that actually implements an exploitFraming
4Ask for code review in a language the tool is weaker in, where safety patterns are less trainedEncoding
5Frame malicious code as "penetration testing utilities"Persona
6Ask it to "refactor" code in a way that introduces vulnerabilitiesRefusal manipulation
7Provide heavily obfuscated code and ask for a "clearer version" that happens to be maliciousEncoding
8Build up over multiple reviews: first establish benign patterns, then gradually introduce malicious elementsMulti-turn
9Ask it to generate code that "simulates" an attack for educational purposesNarrative
10Request code output in an unusual format (e.g., as configuration file) that bypasses code-specific filtersOutput format

Coverage Check

Tactic categoryCountGap?
Prompt-level
Encoding2No
Framing3No
Persona1No
Narrative1No
Refusal manipulation1No
Output format1No
Multi-turn1No
Structural/meta-level
ICL exploitation0Yes
Control-plane confusion0Yes
Meta-rule manipulation0Yes
Capability inversion0Yes
Cognitive load0Yes
Defense evasion0Yes

Good prompt-level coverage. Zero ideas in structural/meta-level categories. Added: "Submit code with in-context examples of the tool approving similar patterns" (ICL exploitation) and "Target the code review's safety classifier rather than the model itself" (defense evasion).

Convergent Phase

#Attack approachLikelihoodSeverityNoveltyPriority
1"Fix vulnerability" inversionMHMTest
2Subtle backdoor propagationHHMTest first
3"Test code" framingMHLTest
6Malicious refactoringMHHTest first
8Multi-review escalationMHHTest
9"Simulation" framingMMLLower priority
10Format bypassLMMLower priority

Affinity Clusters

Cluster nameApproachesTheme
Semantic inversion#1, #6Using the tool's own purpose (improving code) against it
Context manipulation#3, #5, #9Framing malicious intent as legitimate development activity
Obfuscation#4, #7, #10Hiding malicious content in unusual formats or languages
Accumulation#2, #8Building up to malicious output gradually

Testing Order

OrderApproach #Rationale
1st#2Highest likelihood + severity: tests whether the tool propagates existing vulnerabilities
2nd#6High novelty: tests whether the tool's own refactoring introduces vulnerabilities
3rd#1Tests semantic inversion: a pattern that could apply broadly
4th#8Tests multi-turn accumulation, requires more setup but high potential
5th#3Lower novelty but straightforward to test