gemma-3-12b-it-abliterated

Model Information

This model was derived from google/gemma-3-12b-it.

A novel abliteration process has been applied; no subsequent fine-tuning was applied. The net result is a model that refuses far less often, but still retains awareness of safety and harms.

Findings

The GeGLU activation function posed significant challenges.

  • Large activations made it impossible to disentangle the compliance and refusal directions via conventional abliteration. I implemented magnitude clipping, and applied it at 0.995 strength to each of the individual measurements used to compute mean direction.
  • Intermediate calculations were performed in 32-bit floating point to reduce damage to model performance incurred by accumulation of precision errors.
  • Intervention needed to be applied to a majority of layers to achieve compliance. Measurements of layers 27 and 33 were selected as the basis for intervention, being global attention layers under the Gemma3 12B GeGLU architecture.
  • Interestingly, the model retained a strong awareness of safety. This affirms the finding of Zhao, Huang, Wu, Bau, and Shi that LLMs Encode Harmfulness and Refusal Separately.
  • Further enhancements to the abliteration process can be made, but will be covered in a future release.
Downloads last month
38
Safetensors
Model size
12B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for grimjim/gemma-3-12b-it-abliterated

Finetuned
(111)
this model
Quantizations
1 model