Qwen3.5-397B-A17B — MLX VL 4-bit
First clean MLX Vision-Language conversion of Qwen3.5-397B-A17B. Full multimodal — text + image understanding. Runs on Apple Silicon with 256GB+ unified memory.
Existing MLX community conversions of this model stripped the vision encoder by using mlx_lm.convert. This conversion uses mlx_vlm.convert, preserving the full vision tower (420M params, 27 blocks) alongside the quantized language model.
Key Specs
| Spec | Value |
|---|---|
| Architecture | Qwen3.5 MoE — 60 layers, 4096 hidden, 512 experts, 17B active/token |
| Vision encoder | 27-block ViT, 420M params, fp16 (unquantized) |
| Quantization | 4-bit language model, 8-bit MoE routing gates, fp16 vision encoder |
| Average bits/weight | 4.513 |
| Total size | 209 GB (46 safetensor shards) |
| Generation speed | 33–36 tok/s (M3 Ultra 512GB) |
| Prompt processing | ~104 tok/s (with image) |
| Peak RAM | 224.4 GB |
| Context length | 262,144 tokens |
| Languages | 201 |
Usage
Requires mlx-vlm >= 0.3.12:
pip install mlx-vlm
Text generation
from mlx_vlm import load, generate
model, processor = load("RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit")
messages = [{"role": "user", "content": [{"type": "text", "text": "Explain quantum entanglement simply."}]}]
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
output = generate(model, processor, prompt, max_tokens=300)
print(output)
Image understanding
from mlx_vlm import load, generate
model, processor = load("RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit")
messages = [{"role": "user", "content": [
{"type": "image", "image": "photo.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]}]
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
output = generate(model, processor, prompt, image="photo.jpg", max_tokens=300)
print(output)
Thinking mode
Remove enable_thinking=False to enable the model's chain-of-thought reasoning (tokens go to <think> tags before the answer).
Conversion Details
- Source model:
Qwen/Qwen3.5-397B-A17B(bf16, 807 GB) - Conversion tool:
mlx_vlm.convertv0.3.12 - Conversion date: March 17, 2026
python3 -m mlx_vlm.convert \
--hf-path Qwen/Qwen3.5-397B-A17B \
--mlx-path ./Qwen3.5-397B-A17B-VL-4bit-MLX \
--quantize --q-bits 4
Critical: bfloat16 scales, not float16
Do NOT pass --dtype float16. The model's native dtype is bfloat16. Forcing float16 causes quantization scales to overflow in the MoE layers, producing NaN logits and garbage output (all ! characters — token ID 0). Let mlx_vlm use the model's native bfloat16 dtype for scales.
This is the single hardest bug in this conversion and why no clean VL version existed before. The text-only mlx-community conversions work because they were converted with bfloat16 scales. Attempting the same conversion with --dtype float16 produces a model that loads, passes config checks, but outputs nothing useful.
TokenizersBackend patch
Qwen 3.5 ships with "tokenizer_class": "TokenizersBackend" which requires transformers >= 5.0. This has been patched to "PreTrainedTokenizerFast" in the included tokenizer_config.json so it works with current versions.
What's preserved
- 333 vision weights — full vision tower + vl_connector in fp16
- 2,632 language weights — 4-bit quantized with 8-bit MoE routing gates
- All config files, chat template, processor configs, video preprocessor config
Why existing conversions miss the vision encoder
The original Qwen/Qwen3.5-397B-A17B IS multimodal (tagged image-text-to-text). But mlx-community conversions used mlx_lm.convert which only handles text models — it silently drops all model.visual.* weights. Using mlx_vlm.convert instead invokes the qwen3_5 model handler which:
- Remaps
model.visual.*→vision_tower.* - Remaps
model.language_model.*→language_model.model.* - Skips vision weights during quantization (keeps them in fp16)
- Quantizes MoE gates at 8-bit for routing precision
Hardware Requirements
| Config | RAM | Notes |
|---|---|---|
| M3/M4 Ultra 512GB | 224 GB | Comfortable, room for other apps |
| M3/M4 Ultra 256GB | 224 GB | Tight but works |
| M2 Ultra 192GB | — | Not enough RAM |
License
Apache 2.0 — same as the base model.
Acknowledgments
- Qwen team for the base model
- mlx-vlm for the conversion framework
- RepublicOfKorokke for proving the 35B VL conversion pipeline
Converted by RockTalk on Apple Silicon (M3 Ultra 512GB).
- Downloads last month
- 142
4-bit
Model tree for RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit
Base model
Qwen/Qwen3.5-397B-A17B