Mixed Precision GGUF layer quantization of Qwen3.5-4B by Qwen

Original model: https://huggingface.co/Qwen/Qwen3.5-4B

The mixed precision quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:

Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K
Q6_K_M : attn_v = q8_0 ffn_d = q8_0
Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0

   LAYER_TYPES='[
   [0 ,"Q6_K_L"], [1 ,"Q6_K_M"], [2 ,"Q6_K_S"], [3 ,"Q5_K_L"], [4 ,"Q5_K_M"], [5 ,"Q5_K_M"], [6 ,"Q5_K_M"], [7 ,"Q5_K_M"],
   [8 ,"Q5_K_L"], [9 ,"Q5_K_M"], [10,"Q5_K_L"], [11,"Q5_K_M"], [12,"Q5_K_L"], [13,"Q5_K_M"], [14,"Q5_K_M"], [15,"Q5_K_M"],
   [16,"Q6_K_M"], [17,"Q6_K_S"], [18,"Q6_K_M"], [19,"Q6_K_S"], [20,"Q6_K_M"], [21,"Q6_K_M"], [22,"Q6_K_M"], [23,"Q6_K_M"],
   [24,"Q6_K_L"], [25,"Q6_K_M"], [26,"Q6_K_L"], [27,"Q6_K_M"], [28,"Q6_K_L"], [29,"Q6_K_L"], [30,"Q6_K_L"], [31,"Q6_K_L"]
   ]'
   FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"

The layer quants were optimized for strong performance across a set of curated reasoning prompts with a minimum quant of Q5_K_M used across layers.

Comparison:

Quant size PPL Comment
Q6_K 3.5e9 9.7 Q6_K with default embedding and output
Q6_K_H 3.4e9 9.7 Mixed precision quant with Q6_K embedding Q6_K output

Usage:

Qwen3.5-4B is a vision capable dense RL edge model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs while being sized for applications on small/low resource edge platforms. The mmproj file is made available in this repository.

Update 3/18/26: original mmproj had BF16 mmproj tensors. These are still available, unmodified, renamed to *.mmproj.BF16.gguf. New F16 mmproj format is the default to enable working across the widest range of platforms.

Speculation does not work with the model due to the attention scheme it uses. On a 4070 with all layers and context in VRAM with no vision tower approx performance is:

Q QKV NKV gen tps
Q6_K_H F16 240k 84
Q6_K_H Q8_0 440k 85

Long context test (needle in haystack) was tested and passed with fast prompt processing, making large context actually usable with the model.

The model appears to be trained to decide itself whether to do a think block or not. When it does a think block it falls into very heavy overthinking but does come up with accurate answers. Over a small set of eval prompts the model did extremely well. To avoid the overthinking inject think start and think stop tokens first thing after assistant prompt:

THINK_START="<think>\n"
THINK_STOP="\n</think>\n\n"

If the model doesnt feel like doing thinking on a given prompt it will automatically do this. To force the model into a think block inject a bootstrap think block following the assistant prompt:

"<think>\nHere's a thinking process to solve the problem:"

The model was found to be capable on reasoning tasks when skipping think block, with little to no overthinking, just direct deductions to final solutions. When doing thinking with greedy sampling the model will go into infinite rep loops from time to time. This is similar behaviour to other qwen3 thinkers which have trouble with infinite repeat when using greedy sampling particularly at smaller quant sizes (<10B params)

The model was tested in vision mode on a couple pretty tough bird ID image and did OK, iding 1 of 2 tough images correctly.

The model was tested across a small set of code gen prompts and was unable to generate working code on all of the test prompts.

Llama.cpp minimum version to run Qwen3.5-4B should be b8148 and above due to correction of a graph error.

Benchmarks:

A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
Qwen3.5-4B.Q6_K_H.gguf Q6_K_H 3.4e9 B same size as Q6_K
Qwen3.5-4B.mmproj.gguf F16 0.67e9 B multimedia projector
Qwen3.5-4B.mmproj.BF16.gguf BF16 0.68e9 B multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
293
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Qwen3.5-4B-MP-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(88)
this model