Your recipes are great quality!

#1
by ubergarm - opened

Thanks for releasing these great MoE optimized recipes for this amazing model! How you keep attn at full q8_0 and only quantizing the routed experts makes a noticeable difference over other leading recipes for mainline llama.cpp!

Great work and don't forget to eat dinner and get some rest! 😜 β™₯οΈπŸŽ‰

I just downloaded the Q5 version, but I was wondering whether there are the mmproj files available for it as well?

I'm not sure how others are using the models, or if its something I am doing wrong, but I am getting code: 500 errors in both Roo Code and OpenCode as soon as it tries to read a file.

srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing FilterExpression at line 120, column 73 in source:\n..._name, args_value in tool_call.arguments|items %}↡ {{- '<...\n ^\nError: Unknown (built-in) filter 'items' for type String","type":"server_error"}}

I also tried Unsloth quants too - also failing πŸ™

Here's how I am running the model:
llama-server --model /Volumes/LLM/Qwen3.5-397B-A17B-Q5_K_M-AesSedai-GGUF/Qwen3.5-397B-A17B-Q5_K_M-00001-of-00008.gguf --host 0.0.0.0 --ctx-size 0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --alias Q3.5 --port 8111

Try using the autoparser branch on llama.cpp from pwilkin: https://github.com/ggml-org/llama.cpp/pull/18675

Thank you, I will!

But that still leaves me curious as to how other people are running these GGUFs, and what are they using them for - since obviously agentic coding is a big use-case.

Also, not sure if you saw my earlier message about the mmproj file for your Q5 quant.

BTW: The 18685 PR worked, thank you.

Using your Q5 version, on a M3 Ultra 512GB, here were the speeds in Roo Code where the context was ~57K into a task:

slot init_sampler: id 3 | task 2697 | init sampler, took 4.89 ms, tokens: text = 57868, total = 57868
slot update_slots: id 3 | task 2697 | created context checkpoint 6 of 8 (pos_min = 57355, pos_max = 57355, size = 186.329 MiB)
slot print_timing: id 3 | task 2697 |
prompt eval time = 58281.31 ms / 12027 tokens ( 4.85 ms per token, 206.36 tokens per second)
eval time = 112314.18 ms / 1938 tokens ( 57.95 ms per token, 17.26 tokens per second)
total time = 170595.49 ms / 13965 tokens

However it struggled a bit generating Mermaid diagrams, the syntax was a bit off.

Sorry, for the mmproj files you can use unsloth's for that. I've been meaning to upload them to my repo here as well but they will be interchangeable.

It is interesting, I've been going back-and-forth testing the same question in Roo multiple times comparing Unsloth Q3XL vs your Q5, and here's what I've noticed:

The Q3 has no problems generating Mermaid diagrams, whereas the Q5 keeps seeming to have syntax errors.

However, the Q3 answers are slightly shorter while also being slightly more wrong.

The Q5 needs a bit of nudging but is less wrong.

I'm wondering what GGUF to try next. I would ideally not have to move up to a Q8 (since I'd like to have other models running in parallel).

The next up in line sounds like it'd be trying Unsloth's Q6_K, once you get up to the higher BPWs like that you don't have the same benefit as these MoE-optimized quants get.

I've also made and uploaded the mmproj files for this model.

Interesting work, but quants are kinda big compared to Unsloth UD XL ones.

Sign up or log in to comment