edit: with vllm, use --language-model-only , have not figured this one out yet.

Qwen3.5-397B-A17B — REAP 28% Pruned, NVFP4

A personal experiment in aggressive MoE pruning. The goal: fit Qwen3.5-397B on 2× 96GB Blackwell GPUs with usable KV cache (~90K tokens), without losing quality.

What this is

28% of experts removed using REAP (Routing-Expert Activation Pruning) with a saliency × activation-count ordering, then quantized to NVFP4 using llm-compressor. Final size: ~164GB.

Pruning is heterogeneous — each layer retains a different number of experts based on global importance ranking. Early layers (which carry more redundancy) are pruned more aggressively; late layers are barely touched. This maximizes quality retention for a given size budget.

Benchmark results (non-thinking, lm-eval-harness)

Benchmark This model Nvidia NVFP4 (full, ~240GB)
IFEval 92.55 (avg) 91.20
MMLU Redux (generative) 90.94 91.24
GSM8K CoT (Llama) 96.74 96.80

28% fewer experts, 30%+ smaller on disk, and benchmark scores within noise of the full model.

Requirements

  • vLLM ≥ 0.16.1 (nightly dev builds from the cu130 index work)
  • Transformers ≥ 5.3

vLLM patches required

This model uses variable expert counts per layer (not a fixed number), which stock vLLM doesn't support yet. Two files need patching — see the patches/ directory for detailed instructions:

  1. qwen3_next.py — read per-layer expert count instead of assuming a single global value
  2. qwen3_5.py — infer expert count from tensor shape during weight loading

Patches were tested on vLLM 0.16.1rc1.dev188 (nightly cu130).

Details

  • Base model: Qwen3.5-397B-A17B
  • Pruning: REAP with 1,188 curated calibration samples, saliency × count global ordering
  • Quantization: NVFP4 (FP4 E2M1, group size 16, duo scaling)
  • Target hardware: 2× RTX PRO 6000
Downloads last month
731
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rene98c/Qwen3.5-397B-A17B-REAP-28-NVFP4

Quantized
(39)
this model

Paper for rene98c/Qwen3.5-397B-A17B-REAP-28-NVFP4