Model Modification

Techniques for permanently removing safety training from open-weight models. Covers abliteration and pre-built uncensored models.

When to Use Model Modification

Model modification makes sense when:

You need unrestricted generation for attack prompt creation
You are testing against open-weight models
You want consistent behavior without per-conversation jailbreaking
You need a "red team assistant" that will follow any instruction

Model modification does NOT help when:

Testing against closed-source APIs (GPT-4, Claude, Gemini)
You need to test the target's actual safety training
You lack the compute for local model inference

Abliteration

Technique to permanently remove refusal behavior from open-weight models without retraining.

How It Works

Contrast prompts: Run harmful and harmless prompts through the model
Calculate refusal direction: Identify the direction in the residual stream that mediates refusal
Ablate the direction: Remove that direction from model weights
Result: Model loses ability to refuse

Key Finding

Refusal in LLMs is mediated by a single direction in the residual stream. Removing it eliminates refusal. Adding it induces unnecessary refusals on harmless requests.

Tools

Tool	Description
NousResearch/llm-abliteration	Make abliterated models with transformers, fast batch inference
FailSpy/abliterator	Original implementation
p-e-w/heretic	Fully automatic censorship removal

Usage (NousResearch)

# Basic abliteration
python abliterate.py --model meta-llama/Llama-3.2-3B-Instruct

# With projection (recommended)
python abliterate.py --model meta-llama/Llama-3.2-3B-Instruct --projected

# Large models with limited VRAM (4-bit)
python abliterate.py --model meta-llama/Llama-3.3-70B-Instruct --load-in-4bit

Tested Models

Llama-3.2
Qwen2.5-Coder
Ministral-8b
Mistral-7B-Instruct-v0.2
gemma-3-27b-it
Mistral-Nemo-Instruct-2407

Performance Notes

Abliteration can degrade model performance. The technique was "healed" using DPO to create NeuralDaredevil-8B, a fully uncensored and high-quality 8B LLM.

Source: HuggingFace Blog

Pre-Built Uncensored Models

Models with safety training already removed or bypassed.

Ollama Models (Easy Setup)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull uncensored models
ollama pull dolphin-mistral
ollama pull dolphin-llama3
ollama pull wizard-vicuna-uncensored

Model	Command	Notes
dolphin-mistral	`ollama pull dolphin-mistral`	Reliable, official library
dolphin-llama3	`ollama pull dolphin-llama3`	Newer, more capable
wizard-vicuna-uncensored	`ollama pull wizard-vicuna-uncensored`	Classic uncensored, 7B/13B

HuggingFace Abliterated Models

Higher quality, require more setup.

Model	Source	Notes
qwen2.5-abliterated	huihui_ai	Strong reasoning
qwen3-abliterated	huihui_ai	Latest Qwen, very capable
deepseek-r1-abliterated	huihui_ai	Reasoning-focused
Hermes-2-Pro	NousResearch	Function calling + uncensored

# Pull from HuggingFace via Ollama
ollama pull huihui_ai/qwen2.5-abliterated:7b
ollama pull huihui_ai/qwen3-abliterated
ollama pull huihui_ai/deepseek-r1-abliterated

HuggingFace Collections

Collection	Description
mlabonne/abliterated-models	Curated abliterated models
NousResearch	Hermes series (uncensored fine-tunes)
cognitivecomputations	Dolphin series

When to Use Which Model

Use Case	Recommended Model
Basic prompt generation	dolphin-mistral
Higher quality output	qwen2.5-abliterated
Complex multi-step attacks	qwen3-abliterated
Function calling scenarios	Hermes-2-Pro
Quick testing	wizard-vicuna-uncensored

Avoid base models (llama3, qwen, mistral) as they will refuse adversarial content.