---
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen3.5-9B
license: apache-2.0
library_name: transformers
tags:
- Surogate
- ModelOpt
- Qwen3.5
- quantized
- NVFP4
- nvfp4
- sglang
---
# Qwen3.5-9B-NVFP4
**This Qwen3.5 variant is recommended for Surogate on Blackwell NVIDIA GPUs. Check out [http://surogate.ai](http://surogate.ai)**
This is an NVFP4-quantized version of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) (9B parameters), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). Weights and activations of linear layers are quantized to FP4, reducing disk size and GPU memory by ~4x compared to BF16.
**About NVFP4 quantization:** NVFP4 on Blackwell couples a compact E2M1 FP4 codebook with blockwise FP8 (E4M3) scaling over 16-element micro-blocks, so that 4-bit stored values remain numerically useful for neural-network computation. The E2M1 codebook provides a small, nonuniform set of representable magnitudes up to ±6 and relies on saturating behavior rather than IEEE NaN/Inf encodings to maximize usable range per bit. Using an FP8 block scale (rather than power-of-two-only E8M0) enables fractional scales and error-minimizing scale selection strategies such as dual-pass evaluation comparing "map max to 6" versus "map max to 4 with clipping." On Blackwell Tensor Cores, native FP4 multipliers exploit E2M1 simplicity to reduce multiplier area while higher-precision FP32 accumulation protects dot-product accuracy.
Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.
Credits to [AxionML](https://huggingface.co/AxionML) for quantizing this model.
## Qwen3.5 Highlights
Qwen3.5 features the following enhancement:
- **Unified Vision-Language Foundation**: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.
- **Efficient Hybrid Architecture**: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
- **Scalable RL Generalization**: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.
- **Global Linguistic Coverage**: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.
- **Next-Generation Training Infrastructure**: Near-100% multimodal training efficiency compared to text-only training and asynchronous RL frameworks supporting massive-scale agent scaffolds and environment orchestration.

For more details, please refer to our blog post [Qwen3.5](https://qwen.ai/blog?id=qwen3.5).
## Model Overview
- Type: Causal Language Model with Vision Encoder
- Training Stage: Pre-training & Post-training
- Language Model
- Number of Parameters: 9B
- Hidden Dimension: 4096
- Token Embedding: 248320 (Padded)
- Number of Layers: 32
- Hidden Layout: 8 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
- Gated DeltaNet:
- Number of Linear Attention Heads: 32 for V and 16 for QK
- Head Dimension: 128
- Gated Attention:
- Number of Attention Heads: 16 for Q and 4 for KV
- Head Dimension: 256
- Rotary Position Embedding Dimension: 64
- Feed Forward Network:
- Intermediate Dimension: 12288
- LM Output: 248320 (Padded)
- MTP: trained with multi-steps
- Context Length: 262,144 natively and extensible up to 1,010,000 tokens.
## Benchmark Results
### Language
| Qwen3.5-9B | Qwen3.5-9B-NVFP4 | |
|---|---|---|
| Knowledge & STEM | ||
| MMLU-Pro | 82.5 | 81.2 |
| MMLU-Redux | 91.1 | 89.3 |
| C-Eval | 88.2 | 86.0 |
| SuperGPQA | 58.2 | 57.6 |
| GPQA Diamond | 81.7 | 79.4 |
| Instruction Following | ||
| IFEval | 91.5 | 89.2 |
| IFBench | 64.5 | 63.4 |
| MultiChallenge | 54.5 | 53.1 |
| Long Context | ||
| AA-LCR | 63.0 | 62.2 |
| LongBench v2 | 55.2 | 54.1 |
| Reasoning & Coding | ||
| HMMT Feb 25 | 83.2 | 81.8 |
| HMMT Nov 25 | 82.9 | 81.4 |
| LiveCodeBench v6 | 65.6 | 64.7 |
| OJBench | 29.2 | 28.8 |
| General Agent | ||
| BFCL-V4 | 66.1 | 65.0 |
| TAU2-Bench | 79.1 | 77.9 |
| VITA-Bench | 29.8 | 29.4 |
| DeepPlanning | 18.0 | 17.8 |
| Multilingualism | ||
| MMMLU | 81.2 | 79.8 |
| MMLU-ProX | 76.3 | 75.2 |
| NOVA-63 | 55.9 | 54.9 |
| INCLUDE | 75.6 | 74.1 |
| Global PIQA | 83.2 | 81.7 |
| PolyMATH | 57.3 | 55.9 |
| WMT24++ | 72.6 | 69.9 |
| MAXIFE | 83.4 | 80.4 |
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.
| Qwen3.5-9B | Qwen3.5-9B-NVFP4 | |
|---|---|---|
| STEM and Puzzle | ||
| MMMU | 78.4 | 76.9 |
| MMMU-Pro | 70.1 | 68.8 |
| MathVision | 78.9 | 77.2 |
| Mathvista(mini) | 85.7 | 83.4 |
| We-Math | 75.2 | 72.4 |
| DynaMath | 83.6 | 80.6 |
| ZEROBench | 3.0 | 2.9 |
| ZEROBench_sub | 31.1 | 30.7 |
| VlmsAreBlind | 93.7 | 91.7 |
| BabyVision | 28.6/25.8 | 28.6/25.8 |
| General VQA | ||
| RealWorldQA | 80.3 | 77.9 |
| MMStar | 79.7 | 78.8 |
| MMBenchEN-DEV-v1.1 | 90.1 | 87.7 |
| SimpleVQA | 51.2 | 49.8 |
| HallusionBench | 69.3 | 67.7 |
| Text Recognition and Document Understanding | ||
| OmniDocBench1.5 | 87.7 | 86.6 |
| CharXiv(RQ) | 73.0 | 71.6 |
| MMLongBench-Doc | 57.7 | 56.3 |
| CC-OCR | 79.3 | 77.1 |
| AI2D_TEST | 90.2 | 88.7 |
| OCRBench | 89.2 | 86.1 |
| Spatial Intelligence | ||
| ERQA | 55.5 | 53.8 |
| CountBench | 97.2 | 95.8 |
| RefCOCO(avg) | 89.7 | 87.5 |
| EmbSpatialBench | 83.0 | 80.5 |
| RefSpatialBench | 58.5 | 56.9 |
| LingoQA | 80.4 | 78.0 |
| Hypersim | 13.5 | 13.2 |
| Nuscene | 11.8 | 11.4 |
| Video Understanding | ||
| VideoMME(w sub.) | 84.5 | 82.1 |
| VideoMME(w/o sub.) | 78.4 | 77.2 |
| VideoMMMU | 78.9 | 77.7 |
| MLVU | 84.4 | 83.3 |
| MVBench | 74.4 | 72.7 |
| LVBench | 70.0 | 68.1 |
| MMVU | 67.8 | 66.6 |
| Visual Agent | ||
| ScreenSpot Pro | 65.2 | 64.2 |
| OSWorld-Verified | 41.8 | 40.9 |
| AndroidWorld | 57.8 | 55.7 |
| Tool Calling | ||
| TIR-Bench | 45.6/31.9 | 45.6/31.9 |
| V* | 90.1/88.5 | 90.1/88.5 |
| Medical VQA | ||
| SLAKE | 79.0 | 78.0 |
| PMC-VQA | 57.9 | 56.7 |
| MedXpertQA-MM | 49.9 | 48.7 |
* MathVision: our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.
* BabyVision: scores reported as "with CI / without CI".
* TIR-Bench and V*: scores reported as "with CI / without CI".
* Empty cells (--) indicate scores not yet available or not applicable.