Qwen3.5-9B-NVFP4
This is a quantized version of Qwen/Qwen3.5-9B. This model accepts text and images as inputs and generates text as outputs. The weights and activations were quantized to FP4 using llm-compressor with 512 calibration samples from nvidia/Nemotron-Post-Training-Dataset-v2, reducing the model size from 18.0 GB to 11.5 GB (~1.6x reduction) while maintaining 97.3% average accuracy recovery.
Quantization Details
- Scheme: NVFP4
- Calibration: 512 samples (256 reasoning-on + 256 reasoning-off) from Nemotron-Post-Training-Dataset-v2
- Max sequence length: 4096
Inference
This model is supported in vLLM 0.17.0. To serve the model:
vllm serve Kbenkhaled/Qwen3.5-9B-NVFP4 \
--reasoning-parser qwen3 \
--enable-prefix-caching
Evaluation
Evaluated with lm-evaluation-harness, 0-shot, thinking mode ON.
| Benchmark | Qwen3.5-9B | Qwen3.5-9B-NVFP4 (this model) | Recovery |
|---|---|---|---|
| GPQA Diamond | 78.79% | 74.24% | 94.2% |
| IFEval | 94.48% | 92.69% | 98.1% |
| MMLU-Redux | 91.80% | 91.39% | 99.6% |
| Average | 88.36% | 86.11% | 97.3% |
- Downloads last month
- 5,142