397B feels a bit broken, meanwhile 122B and smaller ones seem normal

#4
by Nekotekina - opened

I heard Unsloth team had some problems with attention blocks according to Reddit thread comments https://old.reddit.com/r/LocalLLaMA/comments/1rfds1h/qwen3535ba3b_q4_quantization_comparison/
Given that these quantizations are relatively old, maybe they have similar issues by chance?
UPD: it's hard to describe what am I doing but it seems that the model fails to speak Russian on context about 12k in my evaluation.

Actually, nevermind, I don't have enough data to back up the assumed brokenness. My only problem is lack of model fitting in 256 GB nicely (needs to be about 220 GB because there are other processes on PC that need RAM).

Nekotekina changed discussion status to closed

I don't think there has been a change in either the convert_hf_to_gguf or the llama-quantize for this model, and so I don't think these quantizations would have an issue with that. Mine use Q8_0 for attention and the rest of the tensors that aren't the conditional expert FFNs, except IQ3_S which uses Q6_K.

Sign up or log in to comment