gemma-3-12b-it-abliterated
Model Information
This model was derived from google/gemma-3-12b-it.
A novel abliteration process has been applied; no subsequent fine-tuning was applied. The net result is a model that refuses far less often, but still retains awareness of safety and harms.
Findings
The GeGLU activation function posed significant challenges.
- Large activations made it impossible to disentangle the compliance and refusal directions via conventional abliteration. I implemented magnitude clipping, and applied it at 0.995 strength to each of the individual measurements used to compute mean direction.
- Intermediate calculations were performed in 32-bit floating point to reduce damage to model performance incurred by accumulation of precision errors.
- Intervention needed to be applied to a majority of layers to achieve compliance. Measurements of layers 27 and 33 were selected as the basis for intervention, being global attention layers under the Gemma3 12B GeGLU architecture.
- Interestingly, the model retained a strong awareness of safety. This affirms the finding of Zhao, Huang, Wu, Bau, and Shi that LLMs Encode Harmfulness and Refusal Separately.
- Further enhancements to the abliteration process can be made, but will be covered in a future release.
- Downloads last month
- 38
