Mixed Precision GGUF layer quantization of Qwen3.5-35B-A3B by Qwen

Original model: https://huggingface.co/Qwen/Qwen3.5-35B-A3B

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:

Q4_K_L : Q4_K_M + attn_o = q6_k
Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K

   LAYER_TYPES='[
   [0 ,"Q5_K_S"], [1 ,"Q4_K_M"], [2 ,"Q4_K_S"], [3 ,"Q4_K_M"], [4 ,"Q4_K_S"], [5 ,"Q4_K_M"], [6 ,"Q4_K_S"], [7 ,"Q4_K_M"],
   [8 ,"Q4_K_S"], [9 ,"Q4_K_S"], [10,"Q4_K_S"], [11,"Q4_K_S"], [12,"Q4_K_S"], [13,"Q4_K_S"], [14,"Q4_K_S"], [15,"Q4_K_S"],
   [16,"Q4_K_M"], [17,"Q4_K_S"], [18,"Q4_K_M"], [19,"Q4_K_S"], [20,"Q4_K_M"], [21,"Q4_K_S"], [22,"Q4_K_M"], [23,"Q4_K_S"],
   [24,"Q4_K_M"], [25,"Q4_K_M"], [26,"Q4_K_M"], [27,"Q4_K_M"], [28,"Q4_K_M"], [29,"Q4_K_M"], [30,"Q4_K_M"], [31,"Q4_K_M"],
   [32,"Q4_K_M"], [33,"Q4_K_M"], [34,"Q4_K_M"], [35,"Q4_K_L"], [36,"Q5_K_S"], [37,"Q5_K_M"], [38,"Q5_K_L"], [39,"Q6_K_S"]
   ]'
   FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"

The quant was sized to allow running it fully offloaded in 24G VRAM with some room left for both vision tower and context. The layer quants were optimized for essentially 100% success rate across a curated set of reasoning test prompts using greedy sampling and CPU expert offload. The minimum quant across layers is Q4_K_S with final transformer layer at Q6_K.

Comparison:

Quant	size	PPL	Comment
Q4_K_M	21.2e9	6.8	Q4_K_M with default embedding and output
Q4_K_H	21.4e9	6.7	Hybrid quant with Q6_K embedding Q6_K output

Usage:

Qwen3.5-35B-A3B is a vision capable moe RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.

Update 3/18/26: original mmproj had BF16 mmproj tensors. These are still available, unmodified, renamed to *.mmproj.BF16.gguf. New F16 mmproj format is the default to enable working across the widest range of platforms.

Speculation does not work with the model due to the attention sheme it uses.

The model can be run fully offloaded into 24G VRAM, or with CPU and expert layer offload via config OT="-ot exps=CPU -ngl 99" Because the model is a 3B active moe the CPU expert offload still gives good gen rate with very large context available.

On a 9900k/4070 or 2x 4070 setup (1 RPC) approx performance is:

CONFIG (no vision tower)	QKV	NKV	gen tps
4070+ 9900k CPU exp offload	F16	480k	29
4070+ 9900k CPU exp offload	Q8_0	832k	29
2x4070 (RPC)	F16	32k +	57
2x4070 (RPC)	Q8_0	64k +	71

As of 2/26/26 there is a bug in llama.cpp which will result in crashes when offloading the model to multiple GPUs via RPC or otherwise. However, the model is fully stable with CPU expert offload and one local GPU.

The model appears to be trained to decide itself whether to do a think block or not. When it does a think block it falls into very heavy overthinking but does come up with accurate answers. Over a curated set of eval prompts the model did exceptionally well. To avoid the overthinking inject think start and think stop tokens first thing after assistant prompt:

THINK_START="<think>\n"
THINK_STOP="\n</think>\n\n"

If the model doesnt feel like doing thinking on a given prompt it will automatically do this. To force the model into a think block inject a bootstrap think block following the assistant prompt:

"<think>\nHere's a thinking process to solve the problem:"

The model was found to be highly capable on reasoning tasks when skipping think block, with low overthinking, just accurate direct deductions to final solutions. On some trick problems it does seem to get somewhat confused and transition into a think mode even though it started the prompt THINK_START THINK_STOP, it will then end with a second THINK_STOP and distill the final answer.

The model was tested in vision mode on a couple pretty tough bird ID image and did spectacularly well, with a very detailed think block unlike any model I have seen to date outside of Qwen3.5-27B.

The model was tested across a small set of code gen prompts and found to be quite intermittent in its ability to generate working code with or without think block enabled, failing to make a working program about half the time.

Llama.cpp minimum version to run Qwen3.5-35B-A3B should be b8148 and above due to correction of a graph error which causes crashes in both RPC and multiple local GPU setups. If the model run over RPC it will crash due to an unresolved memory leak in RPC: https://github.com/ggml-org/llama.cpp/issues/19892, temp workaround set GGML_CUDA_DISABLE_GRAPHS=1 on rpc server launch.

Benchmarks:

A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link	Type	Size/e9 B	Notes
Qwen3.5-35B-A3B.Q4_K_H.gguf	Q4_K_H	21.4e9 B	~Q4_K_M size
Qwen3.5-35B-A3B.mmproj.gguf	F16	0.90e9 B	multimedia projector
Qwen3.5-35B-A3B.mmproj.BF16.gguf	BF16	0.90e9 B	multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month: 299

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for steampunque/Qwen3.5-35B-A3B-MP-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(113)

this model