Intro
This is a 2.54 BPW Qwen 3.5 397B quantization using a recipe inspired by @AesSedai and @ubergarm.
My goal was to maximize BPW for my hardware (128G M1 ultra) while allowing up for 128K context.
The recipe is:
TYPE_FFN_GATE_UP_EXPS=IQ2_XXS
TYPE_FFN_DOWN_EXPS=IQ3_XXS
TYPE_TOKEN_EMBEDDING=Q4_K
TYPE_OUTPUT=Q6_K
TYPE_DEFAULT=Q8_0
Running
This is the command I use to run it locally:
llama-server --no-mmap --no-warmup -fa on --model IQ3_XXS/Qwen3.5-397B-A17B-IQ3_XXS-00001-of-00004.gguf --mmproj mmproj-F16.gguf --ctx-size 131072 --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -cram 0
I use -cram 0 because the model + context will take 100% of my available RAM.
Quantizing
Assuming the original model is located at ../../Qwen/Qwen3.5-397B-A17B and llama.cpp (with built binaries) is located at ~/llama.cpp, the full quantization is done with:
./scripts/convert-to-gguf.sh ~/llama.cpp ../../Qwen/Qwen3.5-397B-A17B
./scripts/quantize.sh ~/code/llama.cpp IQ3_XXS
The quantization depends on imatrix.gguf, which was copied from @ubergarm's 397B repo.
- Downloads last month
- 542
Hardware compatibility
Log In to add your hardware
3-bit
Model tree for tarruda/Qwen3.5-397B-A17B-GGUF
Base model
Qwen/Qwen3.5-397B-A17B