2507 Thinking model release

#4
by anjeysapkovski - opened

Dear team!

Current autoround q2ks version of Qwen3-30B-A3B-Instruct-2507 is amazingly fast and stable on consumer's GPU like RTX 3060 12 GB and RTX 5060 16 GB. With 110t/s output and reasonable results in multilanguage tasks.

It fits 12-16 GB VRAM cards ideally, leaving space for reasonable context.

I was not able to find the reasoning version of Qwen3-30B-Instruct-2507 with the same q2ks autoround quantization. Tried quantization on local PC, but 128 GB RAM was not enough even for Qwen3 4B. Could you release the thinking model as q2ks gguf? Currently GPT-OSS 20B MXFP4 is leader for GPU VRAM <= 16 GB.

Intel org

Qwen3-30B-Instruct-2507 is not a MoE model, so a larger loss with 2-bit precision is expected. We'll look into resolving the RAM issue once we're back in the office. Thanks!

Thanks. I'm asking you to generate Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound
By the way, why Instruct model is not MoE? According to official documentation, both Instruct and Thinking models are 3B active parameters MoE (128 experts, 8 active).

Thank you so much!

Please do the same with the 80b version !!

Intel org

Please do the same with the 80b version !!

@groxaxo Do you mean Qwen/Qwen3-Next-80B-A3B-Instruct. Qwen3-Next series is not supported by llama-cpp. After support, we will try to provide a quantitative model.

Intel org

@anjeysapkovski "128 GB RAM was not enough even for Qwen3 4B". I haven't been able to reproduce this issue yet, could you provide more information? For example, operating environment, runing log and etc. We will try to reproduce and fix it.

Sign up or log in to comment