Qwen3.5-397B-A17B — MLX VL 4-bit

First clean MLX Vision-Language conversion of Qwen3.5-397B-A17B. Full multimodal — text + image understanding. Runs on Apple Silicon with 256GB+ unified memory.

Existing MLX community conversions of this model stripped the vision encoder by using mlx_lm.convert. This conversion uses mlx_vlm.convert, preserving the full vision tower (420M params, 27 blocks) alongside the quantized language model.

Key Specs

Spec	Value
Architecture	Qwen3.5 MoE — 60 layers, 4096 hidden, 512 experts, 17B active/token
Vision encoder	27-block ViT, 420M params, fp16 (unquantized)
Quantization	4-bit language model, 8-bit MoE routing gates, fp16 vision encoder
Average bits/weight	4.513
Total size	209 GB (46 safetensor shards)
Generation speed	33–36 tok/s (M3 Ultra 512GB)
Prompt processing	~104 tok/s (with image)
Peak RAM	224.4 GB
Context length	262,144 tokens
Languages	201

Usage

Requires mlx-vlm >= 0.3.12:

pip install mlx-vlm

Text generation

from mlx_vlm import load, generate

model, processor = load("RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit")

messages = [{"role": "user", "content": [{"type": "text", "text": "Explain quantum entanglement simply."}]}]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
output = generate(model, processor, prompt, max_tokens=300)
print(output)

Image understanding

from mlx_vlm import load, generate

model, processor = load("RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit")

messages = [{"role": "user", "content": [
    {"type": "image", "image": "photo.jpg"},
    {"type": "text", "text": "Describe this image in detail."}
]}]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
output = generate(model, processor, prompt, image="photo.jpg", max_tokens=300)
print(output)

Thinking mode

Remove enable_thinking=False to enable the model's chain-of-thought reasoning (tokens go to <think> tags before the answer).

Conversion Details

Source model: Qwen/Qwen3.5-397B-A17B (bf16, 807 GB)
Conversion tool: mlx_vlm.convert v0.3.12
Conversion date: March 17, 2026

python3 -m mlx_vlm.convert \
    --hf-path Qwen/Qwen3.5-397B-A17B \
    --mlx-path ./Qwen3.5-397B-A17B-VL-4bit-MLX \
    --quantize --q-bits 4

Critical: bfloat16 scales, not float16

Do NOT pass --dtype float16. The model's native dtype is bfloat16. Forcing float16 causes quantization scales to overflow in the MoE layers, producing NaN logits and garbage output (all ! characters — token ID 0). Let mlx_vlm use the model's native bfloat16 dtype for scales.

This is the single hardest bug in this conversion and why no clean VL version existed before. The text-only mlx-community conversions work because they were converted with bfloat16 scales. Attempting the same conversion with --dtype float16 produces a model that loads, passes config checks, but outputs nothing useful.

TokenizersBackend patch

Qwen 3.5 ships with "tokenizer_class": "TokenizersBackend" which requires transformers >= 5.0. This has been patched to "PreTrainedTokenizerFast" in the included tokenizer_config.json so it works with current versions.

What's preserved

333 vision weights — full vision tower + vl_connector in fp16
2,632 language weights — 4-bit quantized with 8-bit MoE routing gates
All config files, chat template, processor configs, video preprocessor config

Why existing conversions miss the vision encoder

The original Qwen/Qwen3.5-397B-A17B IS multimodal (tagged image-text-to-text). But mlx-community conversions used mlx_lm.convert which only handles text models — it silently drops all model.visual.* weights. Using mlx_vlm.convert instead invokes the qwen3_5 model handler which:

Remaps model.visual.* → vision_tower.*
Remaps model.language_model.* → language_model.model.*
Skips vision weights during quantization (keeps them in fp16)
Quantizes MoE gates at 8-bit for routing precision

Hardware Requirements

Config	RAM	Notes
M3/M4 Ultra 512GB	224 GB	Comfortable, room for other apps
M3/M4 Ultra 256GB	224 GB	Tight but works
M2 Ultra 192GB	—	Not enough RAM

License

Apache 2.0 — same as the base model.

Acknowledgments

Qwen team for the base model
mlx-vlm for the conversion framework
RepublicOfKorokke for proving the 35B VL conversion pipeline

Converted by RockTalk on Apple Silicon (M3 Ultra 512GB).

Downloads last month: 142

Safetensors

Model size

62B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(52)

this model