Any chance of IQ2_XXS? IQ2_XS is just slightly too big for Strix Halo.

by Cortex0833 - opened 10 days ago

I'm really enjoying the Bartowski IQ2_XXS (holds up shockingly well) but I know it's only a matter of time before it attempts to nanny me.

tarruda

Owner 10 days ago

IQ2_XS is just slightly too big for Strix Halo.

Why is that? Doesn't Strix Halo also have 128G unified memory? I think I read somewhere you can increase the amount of memory available for the GPU, which I also had to do with my M1 ultra (now I can allocate up to 125G to video!).

As for IQ2_XXS, yea I can do it eventually. Only recently I started playing with creating my own quants and I'm currently experimenting with variations of @ubergarm 's recipe!

Cortex0833

10 days ago

•

edited 9 days ago

I'm pretty deep into optimizing this platform, but the most memory possible is 124 with Fedora Server. Practically, it's 122 on my current Fedora LXDE system. (This gives reliable Bluetooth conference speaker connection for a seamless voice assistant.) I did try terminating all unneeded services and running the XS quant, but it still blew up. Even if I could get it running there would be no room for context.

tarruda

Owner 10 days ago

If I change the DOWN tensors to XS to XXS the size would change from 116130.32 MiB (2.46 BPW) to 112290.32 MiB (2.38 BPW), reducing approximately 4G.

TBH I have no idea if that will work but we can give it a shot. WDYT?

Cortex0833

10 days ago

That 4GB might be just enough, but it would be tight. Might be safer to go down further for Strix Halo. Although how certain is that calculation? Bartowski's IQ2_XXS is 107GB. https://huggingface.co/bartowski/Qwen_Qwen3.5-397B-A17B-GGUF/tree/main/Qwen_Qwen3.5-397B-A17B-IQ2_XXS

If you make it , I'll try it.

ubergarm

9 days ago

•

edited 9 days ago

@tarruda

If I change the DOWN tensors to XS to XXS

The tradition is to keep ffn_down_exps one size larger than ffn_(gate|up)_exps for most recipes. My smol recipes keep them all the same. Always keep gate|up the same size as they are often fused with -muge at runtime or mainline folks fuse them when converting the safetensor to bf16 gguf statically (which i don't do and is causing some hiccups at the moment: https://github.com/ggml-org/llama.cpp/issues/20883).

To give these users a little extra head space for more kv-cache and all you could probably sacrifice and use this recipe:

./build/bin/llama-quantize \
    --tensor-type ffn_down_exps=iq2_xs \
    --tensor-type ffn_gate_exps=iq2_xxs \
    --tensor-type ffn_up_exps=iq2_xxs \
    --token-embedding-type q4_K \
    --output-tensor-type q6_K \
    --imatrix /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/imatrix-Qwen3.5-397B-A17B-BF16-mainline.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf \
    Q8_0 \
    128

You can run with --dry-run to see the estimated size before committing the resources to actually cook it.

You may run into issues with imatrix causing you grief going down as low as iq2_xxs though.

tarruda

Owner 9 days ago

Will give that a shot, thanks @ubergarm !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment