Best non-thinking model qwen ever released

#7
by BigBlueWhale - opened

Topic: Qwen3-VL-32B: How to fix a model and ruin a miracle at the same time
We need to talk about what happened to the 32B line. The original Qwen3-32B (April 2025) was a miracle of stability and generalization—easily the #1 open-source model for reliability.
The new VL report reveals a tragic trade-off:

  • The Good (Instruct): They finally fixed the broken Instruct baseline. The original text-only Instruct model was a disaster on complex prompts (Arena-Hard: 37.4), but the VL training resurrected it to a respectable 64.7.
  • The Bad (Thinking): Conversely, they suffocated the "Thinking" variant. The original text model was a creative powerhouse, but the VL Thinking variant regressed across the board compared to its text predecessor:
    • LiveBench: Dropped from 76.8 to 74.7.
    • Creative Writing v3: Dropped from 84.4 to 83.3.
    • Math (AIME-25): Dropped from 85.0 to 83.7.
      The Culprit?
      It looks like data pollution. The report leans heavily on synthetic data generation using the 30B-A3B pipeline. There is nothing worse than polluting a dense masterpiece with inferior MoE synthetic sludge. They seemingly sacrificed the 32B's dense "soul" to force-fit multimodal alignment, and the degradation in reasoning stability proves it.
      Great job fixing the Instruct model, but please stop distilling 30B-A3B output into the 32B Thinking weights! 😠

Sign up or log in to comment