Intro

This is a 2.54 BPW Qwen 3.5 397B quantization using a recipe inspired by @AesSedai and @ubergarm.

My goal was to maximize BPW for my hardware (128G M1 ultra) while allowing up for 128K context.

The recipe is:

TYPE_FFN_GATE_UP_EXPS=IQ2_XXS
TYPE_FFN_DOWN_EXPS=IQ3_XXS
TYPE_TOKEN_EMBEDDING=Q4_K
TYPE_OUTPUT=Q6_K
TYPE_DEFAULT=Q8_0

Running

This is the command I use to run it locally:

llama-server --no-mmap --no-warmup -fa on --model IQ3_XXS/Qwen3.5-397B-A17B-IQ3_XXS-00001-of-00004.gguf --mmproj mmproj-F16.gguf --ctx-size 131072 --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -cram 0

I use -cram 0 because the model + context will take 100% of my available RAM.

Quantizing

Assuming the original model is located at ../../Qwen/Qwen3.5-397B-A17B and llama.cpp (with built binaries) is located at ~/llama.cpp, the full quantization is done with:

./scripts/convert-to-gguf.sh ~/llama.cpp ../../Qwen/Qwen3.5-397B-A17B
./scripts/quantize.sh ~/code/llama.cpp IQ3_XXS

The quantization depends on imatrix.gguf, which was copied from @ubergarm's 397B repo.

Downloads last month: 542

GGUF

Model size

396B params

Architecture

qwen35moe

Hardware compatibility

3-bit

View +1 variant

Model tree for tarruda/Qwen3.5-397B-A17B-GGUF

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(52)

this model