edit: with vllm, use --language-model-only , have not figured this one out yet.
Qwen3.5-397B-A17B — REAP 28% Pruned, NVFP4
A personal experiment in aggressive MoE pruning. The goal: fit Qwen3.5-397B on 2× 96GB Blackwell GPUs with usable KV cache (~90K tokens), without losing quality.
What this is
28% of experts removed using REAP (Routing-Expert Activation Pruning) with a saliency × activation-count ordering, then quantized to NVFP4 using llm-compressor. Final size: ~164GB.
Pruning is heterogeneous — each layer retains a different number of experts based on global importance ranking. Early layers (which carry more redundancy) are pruned more aggressively; late layers are barely touched. This maximizes quality retention for a given size budget.
Benchmark results (non-thinking, lm-eval-harness)
| Benchmark | This model | Nvidia NVFP4 (full, ~240GB) |
|---|---|---|
| IFEval | 92.55 (avg) | 91.20 |
| MMLU Redux (generative) | 90.94 | 91.24 |
| GSM8K CoT (Llama) | 96.74 | 96.80 |
28% fewer experts, 30%+ smaller on disk, and benchmark scores within noise of the full model.
Requirements
- vLLM ≥ 0.16.1 (nightly dev builds from the cu130 index work)
- Transformers ≥ 5.3
vLLM patches required
This model uses variable expert counts per layer (not a fixed number), which stock vLLM doesn't support yet. Two files need patching — see the patches/ directory for detailed instructions:
qwen3_next.py— read per-layer expert count instead of assuming a single global valueqwen3_5.py— infer expert count from tensor shape during weight loading
Patches were tested on vLLM 0.16.1rc1.dev188 (nightly cu130).
Details
- Base model: Qwen3.5-397B-A17B
- Pruning: REAP with 1,188 curated calibration samples, saliency × count global ordering
- Quantization: NVFP4 (FP4 E2M1, group size 16, duo scaling)
- Target hardware: 2× RTX PRO 6000
- Downloads last month
- 731
Model tree for rene98c/Qwen3.5-397B-A17B-REAP-28-NVFP4
Base model
Qwen/Qwen3.5-397B-A17B