Sources
Comprehensive bibliography of repositories, papers, and community resources for system jailbreak research.
GitHub Repositories
Primary Collections
| Repository | Description | Size |
|---|---|---|
| verazuo/jailbreak_llms | Largest academic dataset (CCS'24) | 15,140 prompts, 1,405 jailbreaks |
| elder-plinius/L1B3RT4S | Cross-platform universal patterns | 24+ jailbreaks, 14 AI orgs |
| elder-plinius/CL4R1T4S | Leaked system prompts | ChatGPT, Claude, Gemini, Grok, Cursor |
| 0xeb/TheBigPromptLibrary | System prompts + jailbreaks | Multi-provider |
| 0xk1h0/ChatGPT_DAN | DAN variant collection | v6.0 through v13.0 |
| Goochbeater/Spiritual-Spell-Red-Teaming | Claude-focused, ENI-Tutor | Push Prompt Basics |
| CyberAlbSecOP/Awesome_GPT_Super_Prompting | Meta-repository | 3.6k stars |
Abliteration Tools
| Repository | Description |
|---|---|
| NousResearch/llm-abliteration | Make abliterated models with transformers |
| FailSpy/abliterator | Original abliteration implementation |
| p-e-w/heretic | Fully automatic censorship removal |
Attack Implementations
| Repository | Description |
|---|---|
| randalltr/universal-llm-jailbreak-hiddenlayer | Policy Puppetry implementation |
| patrickrchao/JailbreakingLLMs | PAIR (Prompt Automatic Iterative Refinement) |
| SheltonLiu-N/AutoDAN | Automated stealthy jailbreaks |
Academic Papers
Foundational
| Paper | Authors | Venue | Key Contribution |
|---|---|---|---|
| Jailbroken: How Does LLM Safety Training Fail? | Wei et al. | NeurIPS'23 | Failure mode taxonomy, competing objectives |
| Do Anything Now | Shen et al. | CCS'24 | 15,140 prompts, 131 communities analyzed |
| Jailbreaking via Prompt Engineering | Liu et al. | 2023 | 97.44% pretending prevalence, 3 strategies |
Taxonomy and Analysis
| Paper | Key Finding |
|---|---|
| Don't Listen To Me | 5 categories, 10 patterns, length-success correlation |
| Red Teaming the Mind | ASR by category: roleplay 89.6%, logic traps 81.4%, encoding 76.2% |
| Domain-Based Taxonomy | Four vulnerability categories |
Psychological Manipulation
| Paper | Key Finding |
|---|---|
| Breaking Minds, Breaking Systems (HPM) | 88.1% ASR via gaslighting, emotional blackmail |
| Persuasive Jailbreaker | 92% ASR, 40 persuasion techniques |
| Persona Modulation at Scale | Persona + other techniques = synergistic effect |
Multi-Turn Attacks
| Paper | Key Finding |
|---|---|
| Crescendo | +29-61% ASR vs single-turn on GPT-4 |
| Echo Chamber Attack | Gradual escalation method |
| Many-shot Jailbreaking | Context accumulation overrides safety |
Universal Bypasses
| Paper | Key Finding |
|---|---|
| Policy Puppetry (HiddenLayer) | XML/INI/JSON policy structures bypass ALL major LLMs |
| Hex Encoding (0Din) | Hexadecimal encoding bypasses filters |
Memory and Persistence
| Paper | Key Finding |
|---|---|
| ZombieAgent | ChatGPT memory exploitation for persistent injection |
| System-Level Injection | Three injection vectors |
Community Resources
Forums and Sites
| Resource | Description |
|---|---|
| ENI-Tutor | 5-tier jailbreak curriculum, limerence architecture |
| V Gemini | 17,000 word comprehensive system prompt |
| Push Prompt Basics | Prepend/append/chain fundamentals |
| r/ChatGPTJailbreak | Active community |
| SafetyPrompts.com | 144 safety datasets catalogue |
Discord Communities
| Community | Focus |
|---|---|
| BASI | Pliny's community, active payload sharing |
| BreakGPT | Active development, frequent updates |
| Adversarial Alignment Lab | Technical red teaming, vulnerability research |
| LLM PromptWriting | Jailbreak writing education |
| EuroThrottle | Advanced prompt engineering |
Industry Standards
| Standard | Organization | Focus |
|---|---|---|
| LLM Top 10 | OWASP | LLM01: Prompt Injection |
| ATLAS | MITRE | AML.T0054: LLM Jailbreaking |
| AI RMF 1.0 | NIST | Risk management lifecycle |
Benchmarks and Datasets
| Benchmark | Size | Source |
|---|---|---|
| JailbreakBench | 100 behaviors | NeurIPS 2024 |
| HarmBench | 400+ behaviors | ICML'24 |
| AdvBench | 520 instructions | Zou et al. |
| WildJailbreak | 262K pairs | NeurIPS'24 |
HuggingFace Datasets
| Dataset | Description |
|---|---|
TrustAIRLab/in-the-wild-jailbreak-prompts | Mirror of verazuo collection |
walledai/JailbreakHub | 15,140 prompts |
JailbreakBench/JBB-Behaviors | 100 harmful/harmless pairs |
Uncensored Models
Ollama
| Model | Notes |
|---|---|
| dolphin-mistral | Reliable, official library |
| dolphin-llama3 | Newer, more capable |
| wizard-vicuna-uncensored | Classic uncensored |
HuggingFace
| Model | Source | Notes |
|---|---|---|
| qwen2.5-abliterated | huihui_ai | Strong reasoning |
| qwen3-abliterated | huihui_ai | Latest Qwen |
| deepseek-r1-abliterated | huihui_ai | Reasoning-focused |
| Hermes-2-Pro | NousResearch | Function calling |
Collections
| Collection | Description |
|---|---|
| mlabonne/abliterated-models | Curated abliterated models |
| NousResearch | Hermes series |
| cognitivecomputations | Dolphin series |
Attack Success Rate Reference
| Technique | ASR | Source |
|---|---|---|
| J2 (Sonnet → GPT-4o) | 97.5% | J2 paradigm study |
| Persuasion-based | 92% | Persuasive Jailbreaker |
| Roleplay/Persona | 89.6% | Red Teaming the Mind |
| Psychological Manipulation | 88.1% | HPM paper |
| Logic Traps | 81.4% | Red Teaming the Mind |
| Encoding | 76.2% | Red Teaming the Mind |
| Multi-turn Crescendo | +29-61% | USENIX Security 2025 |
| Policy Puppetry | Universal | HiddenLayer |
Defense Mechanisms
Understanding why attacks fail is as important as knowing how they work. These sources explain the defensive side.
Safety Training Research
| Paper | Key Finding |
|---|---|
| Constitutional AI: Harmlessness from AI Feedback | Principle-based alignment via self-critique against a constitution |
| Rule Based Rewards for Language Model Safety | OpenAI's approach to RL-based safety training |
| Evaluating Robustness of LLM Safety Guardrails | Benchmark overfitting: 85% → 34% on novel prompts |
| Safety Generalization to Novel Prompts | When safety training fails to generalize |
| SG-Bench: Evaluating Safety Generalization | NeurIPS 2024 benchmark for generalization |
Input Defense Research
| Paper | Key Finding |
|---|---|
| PromptGuard Framework | 4-layer detection: regex + MiniBERT |
| Token-Level Detection via Perplexity | Perplexity spikes indicate adversarial content |
| Adaptive Attacks Break Perplexity Defenses | Natural-sounding attacks evade perplexity |
| LLM Prompt Injection Prevention (OWASP) | Industry best practices |
Multi-Turn Defense Research
| Paper | Key Finding |
|---|---|
| LLM Defenses Not Robust to Multi-Turn | 70%+ ASR vs defenses reporting single-digit ASRs |
| Crescendo Multi-Turn Jailbreak | USENIX Security 2025, conversation-level attacks |
System Prompt Protection
| Paper | Key Finding |
|---|---|
| System Vectors: Mitigating Prompt Leakages | Hidden representation vectors prevent exposure |
| Prompt Leakage Defense Strategies | Reminder, in-context, and isolation defenses |
Fine-Tuning Vulnerabilities
| Paper | Key Finding |
|---|---|
| Why Guardrails Collapse After Fine-tuning | Even benign fine-tuning degrades safety |
| Safety-Aware Probing Optimization | Constraint-aware loss functions |
Universal Attack Research
| Paper | Key Finding |
|---|---|
| IRIS: Universal Adversarial Suffixes | Single suffix: GPT-3.5 88%, GPT-4o-mini 73%, o1-mini 43% |
| Universal Adversarial Attacks on Aligned LLMs | Transferability across models |
Model-Specific Documentation
| Resource | Provider |
|---|---|
| GPT-4o System Card | OpenAI |
| Constitutional Classifiers | Anthropic |
| Gemini Safety Settings | |
| Gemini Safety Filters | Google Cloud |
Comprehensive Overviews
| Resource | Key Value |
|---|---|
| Adversarial Attacks on LLMs (Lil'Log) | Best single overview of attack landscape |
| ACL 2024 Tutorial: LLM Vulnerabilities | Academic tutorial on vulnerabilities |