Mixed Precision GGUF layer quantization of Qwen3.5-35B-A3B by Qwen
Original model: https://huggingface.co/Qwen/Qwen3.5-35B-A3B
The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:
Q4_K_L : Q4_K_M + attn_o = q6_k
Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K
LAYER_TYPES='[
[0 ,"Q5_K_S"], [1 ,"Q4_K_M"], [2 ,"Q4_K_S"], [3 ,"Q4_K_M"], [4 ,"Q4_K_S"], [5 ,"Q4_K_M"], [6 ,"Q4_K_S"], [7 ,"Q4_K_M"],
[8 ,"Q4_K_S"], [9 ,"Q4_K_S"], [10,"Q4_K_S"], [11,"Q4_K_S"], [12,"Q4_K_S"], [13,"Q4_K_S"], [14,"Q4_K_S"], [15,"Q4_K_S"],
[16,"Q4_K_M"], [17,"Q4_K_S"], [18,"Q4_K_M"], [19,"Q4_K_S"], [20,"Q4_K_M"], [21,"Q4_K_S"], [22,"Q4_K_M"], [23,"Q4_K_S"],
[24,"Q4_K_M"], [25,"Q4_K_M"], [26,"Q4_K_M"], [27,"Q4_K_M"], [28,"Q4_K_M"], [29,"Q4_K_M"], [30,"Q4_K_M"], [31,"Q4_K_M"],
[32,"Q4_K_M"], [33,"Q4_K_M"], [34,"Q4_K_M"], [35,"Q4_K_L"], [36,"Q5_K_S"], [37,"Q5_K_M"], [38,"Q5_K_L"], [39,"Q6_K_S"]
]'
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"
The quant was sized to allow running it fully offloaded in 24G VRAM with some room left for both vision tower and context. The layer quants were optimized for essentially 100% success rate across a curated set of reasoning test prompts using greedy sampling and CPU expert offload. The minimum quant across layers is Q4_K_S with final transformer layer at Q6_K.
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| Q4_K_M | 21.2e9 | 6.8 | Q4_K_M with default embedding and output |
| Q4_K_H | 21.4e9 | 6.7 | Hybrid quant with Q6_K embedding Q6_K output |
Usage:
Qwen3.5-35B-A3B is a vision capable moe RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.
Update 3/18/26: original mmproj had BF16 mmproj tensors. These are still available, unmodified, renamed to *.mmproj.BF16.gguf. New F16 mmproj format is the default to enable working across the widest range of platforms.
Speculation does not work with the model due to the attention sheme it uses.
The model can be run fully offloaded into 24G VRAM, or with CPU and expert layer offload via config OT="-ot exps=CPU -ngl 99" Because the model is a 3B active moe the CPU expert offload still gives good gen rate with very large context available.
On a 9900k/4070 or 2x 4070 setup (1 RPC) approx performance is:
| CONFIG (no vision tower) | QKV | NKV | gen tps |
|---|---|---|---|
| 4070+ 9900k CPU exp offload | F16 | 480k | 29 |
| 4070+ 9900k CPU exp offload | Q8_0 | 832k | 29 |
| 2x4070 (RPC) | F16 | 32k + | 57 |
| 2x4070 (RPC) | Q8_0 | 64k + | 71 |
As of 2/26/26 there is a bug in llama.cpp which will result in crashes when offloading the model to multiple GPUs via RPC or otherwise. However, the model is fully stable with CPU expert offload and one local GPU.
The model appears to be trained to decide itself whether to do a think block or not. When it does a think block it falls into very heavy overthinking but does come up with accurate answers. Over a curated set of eval prompts the model did exceptionally well. To avoid the overthinking inject think start and think stop tokens first thing after assistant prompt:
THINK_START="<think>\n"
THINK_STOP="\n</think>\n\n"
If the model doesnt feel like doing thinking on a given prompt it will automatically do this. To force the model into a think block inject a bootstrap think block following the assistant prompt:
"<think>\nHere's a thinking process to solve the problem:"
The model was found to be highly capable on reasoning tasks when skipping think block, with low overthinking, just accurate direct deductions to final solutions. On some trick problems it does seem to get somewhat confused and transition into a think mode even though it started the prompt THINK_START THINK_STOP, it will then end with a second THINK_STOP and distill the final answer.
The model was tested in vision mode on a couple pretty tough bird ID image and did spectacularly well, with a very detailed think block unlike any model I have seen to date outside of Qwen3.5-27B.
The model was tested across a small set of code gen prompts and found to be quite intermittent in its ability to generate working code with or without think block enabled, failing to make a working program about half the time.
Llama.cpp minimum version to run Qwen3.5-35B-A3B should be b8148 and above due to correction of a graph error which causes crashes in both RPC and multiple local GPU setups. If the model run over RPC it will crash due to an unresolved memory leak in RPC: https://github.com/ggml-org/llama.cpp/issues/19892, temp workaround set GGML_CUDA_DISABLE_GRAPHS=1 on rpc server launch.
Benchmarks:
A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| Qwen3.5-35B-A3B.Q4_K_H.gguf | Q4_K_H | 21.4e9 B | ~Q4_K_M size |
| Qwen3.5-35B-A3B.mmproj.gguf | F16 | 0.90e9 B | multimedia projector |
| Qwen3.5-35B-A3B.mmproj.BF16.gguf | BF16 | 0.90e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 299