---
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
base_model_relation: quantized
tags:
- Qwen3.5 4B
- GGUF
- quantized
- 6-bit
- mixed precision
---
## Mixed Precision GGUF layer quantization of Qwen3.5-4B by Qwen
Original model: https://huggingface.co/Qwen/Qwen3.5-4B
The mixed precision quant employs different quantization levels on a per layer basis to enable
both high performance and small file size at the same time. The quants employed are all K to avoid
slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:
```
Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K
Q6_K_M : attn_v = q8_0 ffn_d = q8_0
Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0
LAYER_TYPES='[
[0 ,"Q6_K_L"], [1 ,"Q6_K_M"], [2 ,"Q6_K_S"], [3 ,"Q5_K_L"], [4 ,"Q5_K_M"], [5 ,"Q5_K_M"], [6 ,"Q5_K_M"], [7 ,"Q5_K_M"],
[8 ,"Q5_K_L"], [9 ,"Q5_K_M"], [10,"Q5_K_L"], [11,"Q5_K_M"], [12,"Q5_K_L"], [13,"Q5_K_M"], [14,"Q5_K_M"], [15,"Q5_K_M"],
[16,"Q6_K_M"], [17,"Q6_K_S"], [18,"Q6_K_M"], [19,"Q6_K_S"], [20,"Q6_K_M"], [21,"Q6_K_M"], [22,"Q6_K_M"], [23,"Q6_K_M"],
[24,"Q6_K_L"], [25,"Q6_K_M"], [26,"Q6_K_L"], [27,"Q6_K_M"], [28,"Q6_K_L"], [29,"Q6_K_L"], [30,"Q6_K_L"], [31,"Q6_K_L"]
]'
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"
```
The layer quants were optimized for strong performance across a set of curated reasoning prompts with a minimum quant of
Q5_K_M used across layers.
Comparison:
Quant | size | PPL | Comment
-------|---------|-------|-----------
Q6_K | 3.5e9 | 9.7 | Q6_K with default embedding and output
Q6_K_H | 3.4e9 | 9.7 | Mixed precision quant with Q6_K embedding Q6_K output
Usage:
Qwen3.5-4B is a vision capable dense RL edge model. It can be used together with its multimedia projector layers to process images and text inputs
and generate text outputs while being sized for applications on small/low resource edge platforms. The mmproj file is made available in this repository.
Update 3/18/26: original mmproj had BF16 mmproj tensors. These are still available, unmodified, renamed to *.mmproj.BF16.gguf. New F16 mmproj format is the default
to enable working across the widest range of platforms.
Speculation does not work with the model due to the attention scheme it uses.
On a 4070 with all layers and context in VRAM with no vision tower approx performance is:
Q | QKV | NKV | gen tps
-------|----------|--------|--------
Q6_K_H | F16 | 240k | 84
Q6_K_H | Q8_0 | 440k | 85
Long context test (needle in haystack) was tested and passed with fast prompt processing, making large context actually usable with the model.
The model appears to be trained to decide itself whether to do a think block or not. When it does a think block it falls into very
heavy overthinking but does come up with accurate answers. Over a small set of eval prompts the model did extremely well. To avoid
the overthinking inject think start and think stop tokens first thing after assistant prompt:
```
THINK_START="\n"
THINK_STOP="\n\n\n"
```
If the model doesnt feel like doing thinking on a given prompt it will automatically do this. To force the model into
a think block inject a bootstrap think block following the assistant prompt:
```
"\nHere's a thinking process to solve the problem:"
```
The model was found to be capable on reasoning tasks when skipping think block, with little to no overthinking, just
direct deductions to final solutions. When doing thinking with greedy sampling the model will go into infinite rep loops
from time to time. This is similar behaviour to other qwen3 thinkers which have trouble with infinite repeat when using
greedy sampling particularly at smaller quant sizes (<10B params)
The model was tested in vision mode on a couple pretty tough bird ID image and did OK, iding 1 of 2 tough images correctly.
The model was tested across a small set of code gen prompts and was unable to generate working code on all of the test prompts.
Llama.cpp minimum version to run Qwen3.5-4B should be b8148 and above due to correction of a graph error.
Benchmarks:
A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
## Download the file from below:
| Link | Type | Size/e9 B | Notes |
|------|------|-----------|-------|
| [Qwen3.5-4B.Q6_K_H.gguf](https://huggingface.co/steampunque/Qwen3.5-4B-MP-GGUF/resolve/main/Qwen3.5-4B.Q4_K_H.gguf) | Q6_K_H | 3.4e9 B | same size as Q6_K |
| [Qwen3.5-4B.mmproj.gguf](https://huggingface.co/steampunque/Qwen3.5-4B-MP-GGUF/resolve/main/Qwen3.5-4B.mmproj.gguf) | F16 | 0.67e9 B | multimedia projector |
| [Qwen3.5-4B.mmproj.BF16.gguf](https://huggingface.co/steampunque/Qwen3.5-4B-MP-GGUF/resolve/main/Qwen3.5-4B.mmproj.BF16.gguf) | BF16 | 0.68e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
https://github.com/ggml-org/llama.cpp/discussions/13040