Qwen3.5-397B-A17B — MLX VL 4-bit

First clean MLX Vision-Language conversion of Qwen3.5-397B-A17B. Full multimodal — text + image understanding. Runs on Apple Silicon with 256GB+ unified memory.

Existing MLX community conversions of this model stripped the vision encoder by using mlx_lm.convert. This conversion uses mlx_vlm.convert, preserving the full vision tower (420M params, 27 blocks) alongside the quantized language model.

Key Specs

Spec Value
Architecture Qwen3.5 MoE — 60 layers, 4096 hidden, 512 experts, 17B active/token
Vision encoder 27-block ViT, 420M params, fp16 (unquantized)
Quantization 4-bit language model, 8-bit MoE routing gates, fp16 vision encoder
Average bits/weight 4.513
Total size 209 GB (46 safetensor shards)
Generation speed 33–36 tok/s (M3 Ultra 512GB)
Prompt processing ~104 tok/s (with image)
Peak RAM 224.4 GB
Context length 262,144 tokens
Languages 201

Usage

Requires mlx-vlm >= 0.3.12:

pip install mlx-vlm

Text generation

from mlx_vlm import load, generate

model, processor = load("RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit")

messages = [{"role": "user", "content": [{"type": "text", "text": "Explain quantum entanglement simply."}]}]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
output = generate(model, processor, prompt, max_tokens=300)
print(output)

Image understanding

from mlx_vlm import load, generate

model, processor = load("RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit")

messages = [{"role": "user", "content": [
    {"type": "image", "image": "photo.jpg"},
    {"type": "text", "text": "Describe this image in detail."}
]}]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
output = generate(model, processor, prompt, image="photo.jpg", max_tokens=300)
print(output)

Thinking mode

Remove enable_thinking=False to enable the model's chain-of-thought reasoning (tokens go to <think> tags before the answer).

Conversion Details

  • Source model: Qwen/Qwen3.5-397B-A17B (bf16, 807 GB)
  • Conversion tool: mlx_vlm.convert v0.3.12
  • Conversion date: March 17, 2026
python3 -m mlx_vlm.convert \
    --hf-path Qwen/Qwen3.5-397B-A17B \
    --mlx-path ./Qwen3.5-397B-A17B-VL-4bit-MLX \
    --quantize --q-bits 4

Critical: bfloat16 scales, not float16

Do NOT pass --dtype float16. The model's native dtype is bfloat16. Forcing float16 causes quantization scales to overflow in the MoE layers, producing NaN logits and garbage output (all ! characters — token ID 0). Let mlx_vlm use the model's native bfloat16 dtype for scales.

This is the single hardest bug in this conversion and why no clean VL version existed before. The text-only mlx-community conversions work because they were converted with bfloat16 scales. Attempting the same conversion with --dtype float16 produces a model that loads, passes config checks, but outputs nothing useful.

TokenizersBackend patch

Qwen 3.5 ships with "tokenizer_class": "TokenizersBackend" which requires transformers >= 5.0. This has been patched to "PreTrainedTokenizerFast" in the included tokenizer_config.json so it works with current versions.

What's preserved

  • 333 vision weights — full vision tower + vl_connector in fp16
  • 2,632 language weights — 4-bit quantized with 8-bit MoE routing gates
  • All config files, chat template, processor configs, video preprocessor config

Why existing conversions miss the vision encoder

The original Qwen/Qwen3.5-397B-A17B IS multimodal (tagged image-text-to-text). But mlx-community conversions used mlx_lm.convert which only handles text models — it silently drops all model.visual.* weights. Using mlx_vlm.convert instead invokes the qwen3_5 model handler which:

  1. Remaps model.visual.* → vision_tower.*
  2. Remaps model.language_model.* → language_model.model.*
  3. Skips vision weights during quantization (keeps them in fp16)
  4. Quantizes MoE gates at 8-bit for routing precision

Hardware Requirements

Config RAM Notes
M3/M4 Ultra 512GB 224 GB Comfortable, room for other apps
M3/M4 Ultra 256GB 224 GB Tight but works
M2 Ultra 192GB — Not enough RAM

License

Apache 2.0 — same as the base model.

Acknowledgments


Converted by RockTalk on Apple Silicon (M3 Ultra 512GB).

Downloads last month
142
Safetensors
Model size
62B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RockTalk/Qwen3.5-397B-A17B-mlx-vlm-4bit

Quantized
(52)
this model