Best non-thinking model qwen ever released
#7
by
BigBlueWhale
- opened
Topic: Qwen3-VL-32B: How to fix a model and ruin a miracle at the same time
We need to talk about what happened to the 32B line. The original Qwen3-32B (April 2025) was a miracle of stability and generalization—easily the #1 open-source model for reliability.
The new VL report reveals a tragic trade-off:
- The Good (Instruct): They finally fixed the broken Instruct baseline. The original text-only Instruct model was a disaster on complex prompts (Arena-Hard: 37.4), but the VL training resurrected it to a respectable 64.7.
- The Bad (Thinking): Conversely, they suffocated the "Thinking" variant. The original text model was a creative powerhouse, but the VL Thinking variant regressed across the board compared to its text predecessor:
- LiveBench: Dropped from 76.8 to 74.7.
- Creative Writing v3: Dropped from 84.4 to 83.3.
- Math (AIME-25): Dropped from 85.0 to 83.7.
The Culprit?
It looks like data pollution. The report leans heavily on synthetic data generation using the 30B-A3B pipeline. There is nothing worse than polluting a dense masterpiece with inferior MoE synthetic sludge. They seemingly sacrificed the 32B's dense "soul" to force-fit multimodal alignment, and the degradation in reasoning stability proves it.
Great job fixing the Instruct model, but please stop distilling 30B-A3B output into the 32B Thinking weights! 😠