Qwen3.5-9B-abliterated-v2-MAX-NVFP4

Qwen3.5-9B-abliterated-v2-MAX-NVFP4 is an NVFP4-compressed version built on top of prithivMLmods/Qwen3.5-9B-abliterated-v2-MAX. This variant leverages F32 · BF16 · F8_E4M3 · U8 precision formats to significantly reduce memory footprint and improve inference efficiency while maintaining strong output quality. This version preserves the original model’s character and introduces a more optimized abliteration rate, combining refined refusal direction analysis with enhanced training strategies to further minimize internal refusal behaviors while retaining strong reasoning and instruction-following capabilities. The result is a capable 9B parameter language model optimized for detailed responses and improved instruction adherence, now with enhanced deployment efficiency.

This model is intended strictly for research and learning purposes. Due to reduced internal refusal mechanisms, it may generate sensitive or unrestricted content. Users assume full responsibility for how the model is used. The authors and hosting platform disclaim any liability for generated outputs.

Key Highlights

NVFP4 Compression: Utilizes mixed precision (F32 · BF16 · F8_E4M3 · U8) to reduce VRAM usage and accelerate inference.
Optimized Abliteration Rate (v2): Enhanced suppression of refusal directions with improved balance between openness, coherence, and stability.
Advanced Refusal Direction Analysis: Identifies and mitigates refusal-related activations within the model’s latent space.
Abliterated v2 Training Strategy: Further reduces refusal behaviors while maintaining response quality and consistency.
9B Parameter Architecture: Based on Qwen3.5-9B, offering strong reasoning with efficient deployment.
Efficient Deployment: Designed for local inference, experimentation, and research workflows with reduced hardware requirements.

Quick Start with Transformers

pip install transformers==5.4.0
# or
pip install git+https://github.com/huggingface/transformers.git

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen3.5-9B-abliterated-v2-MAX-NVFP4",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/Qwen3.5-9B-abliterated-v2-MAX-NVFP4"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain how transformer models work in simple terms."}
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

Alignment & Refusal Research: Studying abliteration effects under compressed inference.
Red-Teaming Experiments: Evaluating robustness under adversarial or edge-case prompts.
Efficient Local Deployment: Running 9B-class models with reduced VRAM requirements.
Research Prototyping: Experimentation with compression-aware transformer behavior.

Limitations & Risks

Important Note: This model intentionally minimizes built-in safety refusals.

High Risk of Sensitive Outputs: May generate unrestricted or controversial responses.
User Responsibility: Must be used in a safe, ethical, and lawful manner.
Compression Trade-offs: NVFP4 may introduce minor degradation in precision or consistency in edge cases.
Compute Considerations: While reduced, optimal performance still benefits from capable GPUs.