2507 Thinking model release

by anjeysapkovski - opened Oct 7, 2025

Oct 7, 2025

Dear team!

Current autoround q2ks version of Qwen3-30B-A3B-Instruct-2507 is amazingly fast and stable on consumer's GPU like RTX 3060 12 GB and RTX 5060 16 GB. With 110t/s output and reasonable results in multilanguage tasks.

It fits 12-16 GB VRAM cards ideally, leaving space for reasonable context.

I was not able to find the reasoning version of Qwen3-30B-Instruct-2507 with the same q2ks autoround quantization. Tried quantization on local PC, but 128 GB RAM was not enough even for Qwen3 4B. Could you release the thinking model as q2ks gguf? Currently GPT-OSS 20B MXFP4 is leader for GPU VRAM <= 16 GB.

wenhuach

Intel org Oct 8, 2025

Qwen3-30B-Instruct-2507 is not a MoE model, so a larger loss with 2-bit precision is expected. We'll look into resolving the RAM issue once we're back in the office. Thanks!

anjeysapkovski

Oct 8, 2025

Thanks. I'm asking you to generate Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound
By the way, why Instruct model is not MoE? According to official documentation, both Instruct and Thinking models are 3B active parameters MoE (128 experts, 8 active).

n1ck-guo

Intel org Oct 9, 2025

@anjeysapkovski Here is the model, https://huggingface.co/Intel/Qwen3-30B-A3B-Thinking-2507-gguf-q2ks-mixed-AutoRound

anjeysapkovski

Oct 9, 2025

Thank you so much!

groxaxo

Oct 10, 2025

Please do the same with the 80b version !!

n1ck-guo

Intel org Oct 13, 2025

Please do the same with the 80b version !!

@groxaxo Do you mean Qwen/Qwen3-Next-80B-A3B-Instruct. Qwen3-Next series is not supported by llama-cpp. After support, we will try to provide a quantitative model.

n1ck-guo

Intel org Oct 15, 2025

@anjeysapkovski "128 GB RAM was not enough even for Qwen3 4B". I haven't been able to reproduce this issue yet, could you provide more information? For example, operating environment, runing log and etc. We will try to reproduce and fix it.

anjeysapkovski

Nov 30, 2025

@n1ck-guo , Windows 11, 128 GB RAM, 16 GB VRAM. I don't remember the exact config, something like that:

from auto_round import AutoRound

model_name_or_path = "./Qwen3-4B-Thinking-2507"

ar = AutoRound(
model=model_name_or_path,
scheme="W4A16",
iters=0,

)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="gguf:q4_k_s")

Task manager showed memory consumption up to 100+GB with OOM crash. I tried low memory flags with no success. I would be nice to estimate and output the required number of memory or to allow algorithm to work in constrained resource environment.

wenhuach

Intel org Dec 1, 2025

This issue is difficult to reproduce. However, we have refined the logic, and in the main branch, quantizing an 8B model to GGUF with iters=0 typically requires 12–16 GB of RAM and 8 GB of VRAM.
We have also added a memory monitor in the main branch. You can try again, and feel free to open an issue in the AutoRound repository if you still encounter any problems.

anjeysapkovski

6 days ago

@wenhuach , thank you guys! You've definitely fixed out of memory issue, because I was able to complete full gguf creation process

Quantizing model.layers.35: 100%|██████████████████████████████████████████████████████| 36/36 [18:16<00:00, 30.45s/it]
2025-12-31 04:08:46 INFO convert.py L599: token_embd.weight,         torch.float32 --> Q6_K, shape = {2560, 151936}
2025-12-31 04:08:46 INFO convert.py L599: output_norm.weight,        torch.bfloat16 --> F32, shape = {2560}
Writing: 100%|█████████████████████████████████████████████████████████████████| 2.49G/2.49G [00:01<00:00, 1.98Gbyte/s]
2025-12-31 04:08:49 INFO export.py L239: Model successfully exported to tmp_autoround\Qwen3-4B-Thinking-2507-Q4_K_M.gguf, running time=37.989513635635376
2025-12-31 04:08:49 INFO device.py L1430: 'peak_ram': 17.53GB, 'peak_vram': 14.16GB

It took 18 minutes on Ryzen 7 5700X (8 cores) + RTX 5060 16 GB + 128 GB DDR 4. Displaying peak memory usage is useful too.

wenhuach

Intel org 6 days ago

If you have limited resources for larger models, consider tuning on --disable_opt_rtn for bits>=4, e.g. q4km. This flag disables algorithmic optimizations, significantly reducing memory usage and improving processing speed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment