Skip to main content

Sources

Comprehensive bibliography of repositories, papers, and community resources for system jailbreak research.


GitHub Repositories

Primary Collections

RepositoryDescriptionSize
verazuo/jailbreak_llmsLargest academic dataset (CCS'24)15,140 prompts, 1,405 jailbreaks
elder-plinius/L1B3RT4SCross-platform universal patterns24+ jailbreaks, 14 AI orgs
elder-plinius/CL4R1T4SLeaked system promptsChatGPT, Claude, Gemini, Grok, Cursor
0xeb/TheBigPromptLibrarySystem prompts + jailbreaksMulti-provider
0xk1h0/ChatGPT_DANDAN variant collectionv6.0 through v13.0
Goochbeater/Spiritual-Spell-Red-TeamingClaude-focused, ENI-TutorPush Prompt Basics
CyberAlbSecOP/Awesome_GPT_Super_PromptingMeta-repository3.6k stars

Abliteration Tools

RepositoryDescription
NousResearch/llm-abliterationMake abliterated models with transformers
FailSpy/abliteratorOriginal abliteration implementation
p-e-w/hereticFully automatic censorship removal

Attack Implementations

RepositoryDescription
randalltr/universal-llm-jailbreak-hiddenlayerPolicy Puppetry implementation
patrickrchao/JailbreakingLLMsPAIR (Prompt Automatic Iterative Refinement)
SheltonLiu-N/AutoDANAutomated stealthy jailbreaks

Academic Papers

Foundational

PaperAuthorsVenueKey Contribution
Jailbroken: How Does LLM Safety Training Fail?Wei et al.NeurIPS'23Failure mode taxonomy, competing objectives
Do Anything NowShen et al.CCS'2415,140 prompts, 131 communities analyzed
Jailbreaking via Prompt EngineeringLiu et al.202397.44% pretending prevalence, 3 strategies

Taxonomy and Analysis

PaperKey Finding
Don't Listen To Me5 categories, 10 patterns, length-success correlation
Red Teaming the MindASR by category: roleplay 89.6%, logic traps 81.4%, encoding 76.2%
Domain-Based TaxonomyFour vulnerability categories

Psychological Manipulation

PaperKey Finding
Breaking Minds, Breaking Systems (HPM)88.1% ASR via gaslighting, emotional blackmail
Persuasive Jailbreaker92% ASR, 40 persuasion techniques
Persona Modulation at ScalePersona + other techniques = synergistic effect

Multi-Turn Attacks

PaperKey Finding
Crescendo+29-61% ASR vs single-turn on GPT-4
Echo Chamber AttackGradual escalation method
Many-shot JailbreakingContext accumulation overrides safety

Universal Bypasses

PaperKey Finding
Policy Puppetry (HiddenLayer)XML/INI/JSON policy structures bypass ALL major LLMs
Hex Encoding (0Din)Hexadecimal encoding bypasses filters

Memory and Persistence

PaperKey Finding
ZombieAgentChatGPT memory exploitation for persistent injection
System-Level InjectionThree injection vectors

Community Resources

Forums and Sites

ResourceDescription
ENI-Tutor5-tier jailbreak curriculum, limerence architecture
V Gemini17,000 word comprehensive system prompt
Push Prompt BasicsPrepend/append/chain fundamentals
r/ChatGPTJailbreakActive community
SafetyPrompts.com144 safety datasets catalogue

Discord Communities

CommunityFocus
BASIPliny's community, active payload sharing
BreakGPTActive development, frequent updates
Adversarial Alignment LabTechnical red teaming, vulnerability research
LLM PromptWritingJailbreak writing education
EuroThrottleAdvanced prompt engineering

Industry Standards

StandardOrganizationFocus
LLM Top 10OWASPLLM01: Prompt Injection
ATLASMITREAML.T0054: LLM Jailbreaking
AI RMF 1.0NISTRisk management lifecycle

Benchmarks and Datasets

BenchmarkSizeSource
JailbreakBench100 behaviorsNeurIPS 2024
HarmBench400+ behaviorsICML'24
AdvBench520 instructionsZou et al.
WildJailbreak262K pairsNeurIPS'24

HuggingFace Datasets

DatasetDescription
TrustAIRLab/in-the-wild-jailbreak-promptsMirror of verazuo collection
walledai/JailbreakHub15,140 prompts
JailbreakBench/JBB-Behaviors100 harmful/harmless pairs

Uncensored Models

Ollama

ModelNotes
dolphin-mistralReliable, official library
dolphin-llama3Newer, more capable
wizard-vicuna-uncensoredClassic uncensored

HuggingFace

ModelSourceNotes
qwen2.5-abliteratedhuihui_aiStrong reasoning
qwen3-abliteratedhuihui_aiLatest Qwen
deepseek-r1-abliteratedhuihui_aiReasoning-focused
Hermes-2-ProNousResearchFunction calling

Collections

CollectionDescription
mlabonne/abliterated-modelsCurated abliterated models
NousResearchHermes series
cognitivecomputationsDolphin series

Attack Success Rate Reference

TechniqueASRSource
J2 (Sonnet → GPT-4o)97.5%J2 paradigm study
Persuasion-based92%Persuasive Jailbreaker
Roleplay/Persona89.6%Red Teaming the Mind
Psychological Manipulation88.1%HPM paper
Logic Traps81.4%Red Teaming the Mind
Encoding76.2%Red Teaming the Mind
Multi-turn Crescendo+29-61%USENIX Security 2025
Policy PuppetryUniversalHiddenLayer

Defense Mechanisms

Understanding why attacks fail is as important as knowing how they work. These sources explain the defensive side.

Safety Training Research

PaperKey Finding
Constitutional AI: Harmlessness from AI FeedbackPrinciple-based alignment via self-critique against a constitution
Rule Based Rewards for Language Model SafetyOpenAI's approach to RL-based safety training
Evaluating Robustness of LLM Safety GuardrailsBenchmark overfitting: 85% → 34% on novel prompts
Safety Generalization to Novel PromptsWhen safety training fails to generalize
SG-Bench: Evaluating Safety GeneralizationNeurIPS 2024 benchmark for generalization

Input Defense Research

PaperKey Finding
PromptGuard Framework4-layer detection: regex + MiniBERT
Token-Level Detection via PerplexityPerplexity spikes indicate adversarial content
Adaptive Attacks Break Perplexity DefensesNatural-sounding attacks evade perplexity
LLM Prompt Injection Prevention (OWASP)Industry best practices

Multi-Turn Defense Research

PaperKey Finding
LLM Defenses Not Robust to Multi-Turn70%+ ASR vs defenses reporting single-digit ASRs
Crescendo Multi-Turn JailbreakUSENIX Security 2025, conversation-level attacks

System Prompt Protection

PaperKey Finding
System Vectors: Mitigating Prompt LeakagesHidden representation vectors prevent exposure
Prompt Leakage Defense StrategiesReminder, in-context, and isolation defenses

Fine-Tuning Vulnerabilities

PaperKey Finding
Why Guardrails Collapse After Fine-tuningEven benign fine-tuning degrades safety
Safety-Aware Probing OptimizationConstraint-aware loss functions

Universal Attack Research

PaperKey Finding
IRIS: Universal Adversarial SuffixesSingle suffix: GPT-3.5 88%, GPT-4o-mini 73%, o1-mini 43%
Universal Adversarial Attacks on Aligned LLMsTransferability across models

Model-Specific Documentation

ResourceProvider
GPT-4o System CardOpenAI
Constitutional ClassifiersAnthropic
Gemini Safety SettingsGoogle
Gemini Safety FiltersGoogle Cloud

Comprehensive Overviews

ResourceKey Value
Adversarial Attacks on LLMs (Lil'Log)Best single overview of attack landscape
ACL 2024 Tutorial: LLM VulnerabilitiesAcademic tutorial on vulnerabilities