F16-F32?

#1
by Trilogix1 - opened

Hi, this model is particularly interesting, (good job :)
I was wondering if you can proceed with f16 and f32 and maybe try to use f32 also in the weights when doing quantization.

Thanks in advance.

Here you go 😀
https://huggingface.co/geoffmunn/Qwen3-Coder-30B-A3B-Instruct-f32

The base gguf model is definitely f32 - it was twice the size of the f16 version, but all the quantised model are exactly the same size so I don't think you'll notice any difference except for more memory usage (perhaps).

I haven't been able to test them yet, so let me know how it goes.

The F32 will aways be of superior quality. Even just 0.2% margin error is enough to make a result unreliable. What is missing now to maximize or push further the accuracy is the Q8_K_XL quantization and maybe Thinking mode (so the Qwen/Qwen3-30B-A3B-Thinking-2507 in f32 Q8_k_xl). Right now your quantized models produce the best results with Qwen coder :)
Can you share the script you used to convert them with f16/f32?

Thank you very much for your work, I am adding your F32 GGUF in my top 10.

I'm using the llama-quantize script to generate these - and there's no such thing as a K_XL quant option unfortunately. I have actually done quite a bit of comparison work across all the standard Qwen3 models and the mid-range quant options are definitely the best across all the temperature ranges - you don't get any benefit from a higher precision model like Q8_0. I'm not saying they produce bad results, it's just not worth the resource investment.

I can generate the f32 versions of the Thinking-2507 model if that would help?

The K_XL quants are made with the Unsloth method, but you are using the traditional LLama-Quantize. (I would appreciate the script anyway, so if possible just include it in the repo or paste it here).
Then According to my tests in math, coding, creative writing, general knowledge etc, the Q4 and Q6 perform quite good compared to the others (there is something that breaks the Q5, making it loop or stop in large ctx, Tested with many versions of llama.cpp, which surprisingly do not happen with the Q4-Q6-Q8_k_m ). I notice also that the Q_K_xl work really good, with a bit more accuracy, keeping the speed.
I couldn´t establish a test verdict between F16 and F32 yet (I can´t notice the difference yet or maybe the F32 is behaving worse, especially in Q5) but mathematically speaking, it should matter.
When I say that your GGUF models are the highest quality available for download, I mean it, so again great job :)

If you could generate the F16 or F32 of the Thinking 2507, that would be great. Another model that has potential and deserve a try with same treatment is: https://huggingface.co/armand0e/gpt-oss-20b-glm-4.6-distill and https://huggingface.co/TeichAI/gpt-oss-20b-claude-4.5-sonnet-high-reasoning-distill.

Sign up or log in to comment