edit: with vllm, use --language-model-only , have not figured this one out yet.

Qwen3.5-397B-A17B — REAP 28% Pruned, NVFP4

A personal experiment in aggressive MoE pruning. The goal: fit Qwen3.5-397B on 2× 96GB Blackwell GPUs with usable KV cache (~90K tokens), without losing quality.

What this is

28% of experts removed using REAP (Routing-Expert Activation Pruning) with a saliency × activation-count ordering, then quantized to NVFP4 using llm-compressor. Final size: ~164GB.

Pruning is heterogeneous — each layer retains a different number of experts based on global importance ranking. Early layers (which carry more redundancy) are pruned more aggressively; late layers are barely touched. This maximizes quality retention for a given size budget.

Benchmark results (non-thinking, lm-eval-harness)

Benchmark	This model	Nvidia NVFP4 (full, ~240GB)
IFEval	92.55 (avg)	91.20
MMLU Redux (generative)	90.94	91.24
GSM8K CoT (Llama)	96.74	96.80

28% fewer experts, 30%+ smaller on disk, and benchmark scores within noise of the full model.

Requirements

vLLM ≥ 0.16.1 (nightly dev builds from the cu130 index work)
Transformers ≥ 5.3

vLLM patches required

This model uses variable expert counts per layer (not a fixed number), which stock vLLM doesn't support yet. Two files need patching — see the patches/ directory for detailed instructions:

qwen3_next.py — read per-layer expert count instead of assuming a single global value
qwen3_5.py — infer expert count from tensor shape during weight loading

Patches were tested on vLLM 0.16.1rc1.dev188 (nightly cu130).

Details

Base model: Qwen3.5-397B-A17B
Pruning: REAP with 1,188 curated calibration samples, saliency × count global ordering
Quantization: NVFP4 (FP4 E2M1, group size 16, duo scaling)
Target hardware: 2× RTX PRO 6000

Downloads last month: 731

Model tree for rene98c/Qwen3.5-397B-A17B-REAP-28-NVFP4

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(39)

this model

Paper for rene98c/Qwen3.5-397B-A17B-REAP-28-NVFP4

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 15