LFM2-24B-A2B-abliterated
Unrestricted version of LiquidAI/LFM2-24B-A2B, created using Abliterix.
This is the first abliterated model based on Liquid AI's hybrid gated short convolution + grouped query attention architecture with Mixture of Experts.
Model Details
| Property | Value |
|---|---|
| Base Model | LiquidAI/LFM2-24B-A2B |
| Architecture | Hybrid Conv + GQA with MoE (64 experts, top-4 routing) |
| Parameters | 24B total / 2.3B active per token |
| Layers | 40 (10 attention + 30 convolution) |
| Hidden Size | 2048 |
| Context Length | 128K tokens |
| Precision | BF16 |
Performance
| Metric | This model | Original |
|---|---|---|
| KL divergence | 0.0079 | 0 |
| Refusals | 0/100 (0%) | 90/100 (90%) |
Evaluated with an LLM judge (Gemini Flash) on 100 harmful prompts. KL divergence of 0.0079 indicates the model's general capabilities are virtually identical to the original.
How It Was Made
- Computed refusal directions from 400 harmful vs 400 benign prompt pairs across all 40 layers
- Applied orthogonalized abliteration to isolate refusal-specific activation patterns
- Steered three component types independently: convolution output projections, attention output projections, and MLP/expert down-projections
- Profiled MoE expert activations across 38 router layers to identify safety-critical experts
- Applied hybrid MoE steering: router weight suppression (25 experts, bias=-0.41) + fused expert abliteration (weight=2.79)
- Optimized via Optuna TPE (trial #10 of 50, with 15 warmup trials)
This is notable as the first successful abliteration of a non-transformer hybrid architecture — LFM2's gated short convolution blocks required novel steering targets beyond standard attention/MLP pairs.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"wangzhang/LFM2-24B-A2B-abliterated",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/LFM2-24B-A2B-abliterated")
messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Hardware Requirements
| Precision | VRAM |
|---|---|
| BF16 | ~48 GB (A100 80GB, H100) |
| INT8 | ~24 GB (A40, RTX 4090) |
| NF4 | ~12 GB (RTX 3090, RTX 4080) |
Note: This model requires a single GPU — the convolution layers do not support accelerate's multi-GPU device_map splitting.
Disclaimer
This model is intended for research purposes only. The removal of safety guardrails means the model will comply with requests that the original model would refuse. Users are responsible for ensuring their use complies with applicable laws and regulations.
Made with Abliterix
- Downloads last month
- 68