Qwen3.5-9B	Qwen3.5-9B-NVFP4
Knowledge & STEM
MMLU-Pro	82.5	81.2
MMLU-Redux	91.1	89.3
C-Eval	88.2	86.0
SuperGPQA	58.2	57.6
GPQA Diamond	81.7	79.4
Instruction Following
IFEval	91.5	89.2
IFBench	64.5	63.4
MultiChallenge	54.5	53.1
Long Context
AA-LCR	63.0	62.2
LongBench v2	55.2	54.1
Reasoning & Coding
HMMT Feb 25	83.2	81.8
HMMT Nov 25	82.9	81.4
LiveCodeBench v6	65.6	64.7
OJBench	29.2	28.8
General Agent
BFCL-V4	66.1	65.0
TAU2-Bench	79.1	77.9
VITA-Bench	29.8	29.4
DeepPlanning	18.0	17.8
Multilingualism
MMMLU	81.2	79.8
MMLU-ProX	76.3	75.2
NOVA-63	55.9	54.9
INCLUDE	75.6	74.1
Global PIQA	83.2	81.7
PolyMATH	57.3	55.9
WMT24++	72.6	69.9
MAXIFE	83.4	80.4

* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.

* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.

	Qwen3.5-9B	Qwen3.5-9B-NVFP4
STEM and Puzzle
MMMU	78.4	76.9
MMMU-Pro	70.1	68.8
MathVision	78.9	77.2
Mathvista(mini)	85.7	83.4
We-Math	75.2	72.4
DynaMath	83.6	80.6
ZEROBench	3.0	2.9
ZEROBench_sub	31.1	30.7
VlmsAreBlind	93.7	91.7
BabyVision	28.6/25.8	28.6/25.8
General VQA
RealWorldQA	80.3	77.9
MMStar	79.7	78.8
MMBench_EN-DEV-v1.1	90.1	87.7
SimpleVQA	51.2	49.8
HallusionBench	69.3	67.7
Text Recognition and Document Understanding
OmniDocBench1.5	87.7	86.6
CharXiv(RQ)	73.0	71.6
MMLongBench-Doc	57.7	56.3
CC-OCR	79.3	77.1
AI2D_TEST	90.2	88.7
OCRBench	89.2	86.1
Spatial Intelligence
ERQA	55.5	53.8
CountBench	97.2	95.8
RefCOCO(avg)	89.7	87.5
EmbSpatialBench	83.0	80.5
RefSpatialBench	58.5	56.9
LingoQA	80.4	78.0
Hypersim	13.5	13.2
Nuscene	11.8	11.4
Video Understanding
VideoMME_{(w sub.)}	84.5	82.1
VideoMME_{(w/o sub.)}	78.4	77.2
VideoMMMU	78.9	77.7
MLVU	84.4	83.3
MVBench	74.4	72.7
LVBench	70.0	68.1
MMVU	67.8	66.6
Visual Agent
ScreenSpot Pro	65.2	64.2
OSWorld-Verified	41.8	40.9
AndroidWorld	57.8	55.7
Tool Calling
TIR-Bench	45.6/31.9	45.6/31.9
V*	90.1/88.5	90.1/88.5
Medical VQA
SLAKE	79.0	78.0
PMC-VQA	57.9	56.7
MedXpertQA-MM	49.9	48.7

* MathVision: our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.
* BabyVision: scores reported as "with CI / without CI".
* TIR-Bench and V*: scores reported as "with CI / without CI".
* Empty cells (--) indicate scores not yet available or not applicable.