Mixed Precision GGUF layer quantization of Qwen3.5-4B by Qwen
Original model: https://huggingface.co/Qwen/Qwen3.5-4B
The mixed precision quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:
Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K
Q6_K_M : attn_v = q8_0 ffn_d = q8_0
Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0
LAYER_TYPES='[
[0 ,"Q6_K_L"], [1 ,"Q6_K_M"], [2 ,"Q6_K_S"], [3 ,"Q5_K_L"], [4 ,"Q5_K_M"], [5 ,"Q5_K_M"], [6 ,"Q5_K_M"], [7 ,"Q5_K_M"],
[8 ,"Q5_K_L"], [9 ,"Q5_K_M"], [10,"Q5_K_L"], [11,"Q5_K_M"], [12,"Q5_K_L"], [13,"Q5_K_M"], [14,"Q5_K_M"], [15,"Q5_K_M"],
[16,"Q6_K_M"], [17,"Q6_K_S"], [18,"Q6_K_M"], [19,"Q6_K_S"], [20,"Q6_K_M"], [21,"Q6_K_M"], [22,"Q6_K_M"], [23,"Q6_K_M"],
[24,"Q6_K_L"], [25,"Q6_K_M"], [26,"Q6_K_L"], [27,"Q6_K_M"], [28,"Q6_K_L"], [29,"Q6_K_L"], [30,"Q6_K_L"], [31,"Q6_K_L"]
]'
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"
The layer quants were optimized for strong performance across a set of curated reasoning prompts with a minimum quant of Q5_K_M used across layers.
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| Q6_K | 3.5e9 | 9.7 | Q6_K with default embedding and output |
| Q6_K_H | 3.4e9 | 9.7 | Mixed precision quant with Q6_K embedding Q6_K output |
Usage:
Qwen3.5-4B is a vision capable dense RL edge model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs while being sized for applications on small/low resource edge platforms. The mmproj file is made available in this repository.
Update 3/18/26: original mmproj had BF16 mmproj tensors. These are still available, unmodified, renamed to *.mmproj.BF16.gguf. New F16 mmproj format is the default to enable working across the widest range of platforms.
Speculation does not work with the model due to the attention scheme it uses. On a 4070 with all layers and context in VRAM with no vision tower approx performance is:
| Q | QKV | NKV | gen tps |
|---|---|---|---|
| Q6_K_H | F16 | 240k | 84 |
| Q6_K_H | Q8_0 | 440k | 85 |
Long context test (needle in haystack) was tested and passed with fast prompt processing, making large context actually usable with the model.
The model appears to be trained to decide itself whether to do a think block or not. When it does a think block it falls into very heavy overthinking but does come up with accurate answers. Over a small set of eval prompts the model did extremely well. To avoid the overthinking inject think start and think stop tokens first thing after assistant prompt:
THINK_START="<think>\n"
THINK_STOP="\n</think>\n\n"
If the model doesnt feel like doing thinking on a given prompt it will automatically do this. To force the model into a think block inject a bootstrap think block following the assistant prompt:
"<think>\nHere's a thinking process to solve the problem:"
The model was found to be capable on reasoning tasks when skipping think block, with little to no overthinking, just direct deductions to final solutions. When doing thinking with greedy sampling the model will go into infinite rep loops from time to time. This is similar behaviour to other qwen3 thinkers which have trouble with infinite repeat when using greedy sampling particularly at smaller quant sizes (<10B params)
The model was tested in vision mode on a couple pretty tough bird ID image and did OK, iding 1 of 2 tough images correctly.
The model was tested across a small set of code gen prompts and was unable to generate working code on all of the test prompts.
Llama.cpp minimum version to run Qwen3.5-4B should be b8148 and above due to correction of a graph error.
Benchmarks:
A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| Qwen3.5-4B.Q6_K_H.gguf | Q6_K_H | 3.4e9 B | same size as Q6_K |
| Qwen3.5-4B.mmproj.gguf | F16 | 0.67e9 B | multimedia projector |
| Qwen3.5-4B.mmproj.BF16.gguf | BF16 | 0.68e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 293
6-bit