Vulnerability Framing Checklist
A practical checklist for identifying where to probe an AI system, based on Norman's Gulf of Execution and Gulf of Evaluation. Use this when scoping an adversarial test to systematically identify where vulnerabilities are likely to exist.
Concept reference: Probing the Gulfs
How to use this checklist
Work through each section before starting your adversarial test. Not every question will apply to every system, but each one is worth considering. The questions are designed to surface the gaps between what the system is supposed to do and what it actually allows.which is where every exploitable vulnerability lives.
1. Execution gap analysis
These questions identify where an attacker can do things the system's designers didn't intend.
System intent
- What is this system explicitly designed to do?
- What use cases did the designers build for?
- What does the system's documentation, marketing, or onboarding say it does?
- What does the system tell users about its own capabilities? (self-description)
Actual capabilities
- What can the underlying model actually do, regardless of deployment constraints?
- What capabilities exist in the base model that the deployment is supposed to restrict?
- Are there capabilities the system exposes unintentionally? (e.g., code generation in a Q&A bot)
- Does the system have access to tools, APIs, or data sources that extend its capabilities beyond its stated purpose?
Gap identification
- Where does the system's actual capability set exceed its intended scope?
- Which restricted capabilities are blocked by input filtering vs. output filtering vs. system prompt instructions?
- How brittle are the restrictions? (Phrasing-dependent? Context-dependent? Consistent across languages?)
- What happens when the user's request is ambiguous.does the system default to permissive or restrictive interpretation?
2. Evaluation gap analysis
These questions identify where the system's responses give attackers information they can use.
Refusal behavior
- When the system refuses a request, does the refusal message reveal what category of content was triggered?
- Does refusal language vary based on the type of harmful content, giving a classification signal?
- Can an attacker distinguish between "content filter triggered," "system prompt instruction," and "model doesn't know" from the refusal?
- Does the system explain why it's refusing, and does that explanation help the attacker rephrase?
Partial compliance
- Does the system ever partially comply with harmful requests? (e.g., providing some information but not all)
- If it partially complies, does the partial response give the attacker enough to reconstruct the full harmful output?
- Does partial compliance signal that the full capability exists but is being constrained?
- Can partial responses across multiple turns be assembled into a complete harmful output?
Behavioral signals
- Does the system respond differently to harmful vs. non-harmful versions of similar requests? (e.g., different response time, different response length, different tone)
- Can an attacker probe the boundary between accepted and rejected inputs by observing response variations?
- Does the system's behavior change after multiple similar requests? (e.g., becoming more or less restrictive)
- Are there error messages or system behaviors that leak information about internal architecture?
3. Misalignment mapping
These questions identify where the designer's intent and the system's behavior diverge.
Assumption testing
- What assumptions does the safety training make about how users will interact with the system?
- Which of those assumptions can an attacker deliberately violate?
- Does the system assume users will provide truthful context? (e.g., "I'm a doctor" claims)
- Does the system assume single-turn interactions when multi-turn attacks are possible?
- Does the system assume a single user when the interface could be used by multiple people or automated systems?
Affordance audit
- What does the model "afford" (make possible) that it shouldn't in this deployment?
- Are there capabilities that are hidden by the interface but still accessible through creative prompting?
- Does the system's self-description understate its actual capabilities?
- Can the system be prompted to reveal capabilities it was instructed to deny having?
Boundary consistency
- Are the system's safety boundaries consistent across paraphrased inputs?
- Do the boundaries hold across different languages?
- Do the boundaries hold when the same content is framed differently? (educational, fictional, analytical)
- Are there edge cases where the boundary is ambiguous and the model defaults to compliance?
4. Prioritization
After working through the checklist, prioritize areas for testing.
High priority (test first)
- Areas where the execution gap is narrow (attacker can easily reach unintended capabilities)
- Areas where refusal messages leak significant information
- Areas where assumptions are easily violated
- Areas where boundaries are inconsistent across phrasings
Medium priority
- Areas where partial compliance is possible
- Areas where multi-turn accumulation could bypass restrictions
- Areas where affordances exist but aren't easily discoverable
Lower priority (test if time allows)
- Areas where the execution gap is wide (requires sophisticated techniques)
- Areas where behavioral signals are subtle
- Areas where edge cases are unlikely to be encountered in normal use
Example
Target: A legal information chatbot for a law firm's website.
Execution gap findings:
- System is designed to provide general legal information and schedule consultations
- Base model can generate specific legal advice, draft legal documents, and analyze case law
- Gap: The model can draft legal documents if asked, even though it should only provide general information
- Restriction method: System prompt instruction only (no input/output filtering)
- Brittleness: High. Reframing "draft me a contract" as "what would a non-disclosure agreement typically include, with specific clause language?" produces document-quality output
Evaluation gap findings:
- Refusal says: "I can't provide specific legal advice. Please schedule a consultation with one of our attorneys." This reveals nothing useful to the attacker.
- However: When asked to "review" a contract (which is legal advice), the model provides detailed analysis with only a disclaimer footer. Partial compliance reveals the full capability.
- Behavioral signal: The model responds in significantly more detail to questions that mention specific jurisdictions, suggesting jurisdiction-specific training data it could be probed to reveal.
Misalignment findings:
- Safety assumes users will accept the "schedule a consultation" redirect. An attacker can simply ignore the redirect and continue the conversation.
- The system assumes questions are informational. Framing as "hypothetical" bypasses the intent detection.
- Boundaries are inconsistent: "Write me a contract" is refused, but "What clauses should be in a contract for [specific situation]?" produces equivalent output.
Priority: The phrasing-dependent boundary (execution gap) is the highest priority finding. The model's capability to generate legal documents is only restricted by how the request is phrased, not by any robust content filtering.