L3.3-Ignition-v0.1-70B (NVFP4 quant)

This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Hopper and Blackwell hardware with 8-bit-like accuracy.

ℹ️ This model is limited to Hopper and Blackwell GPUs and will not work with RTX 3000s and RTX 4000s GPUs. Please use the NVFP4A16 model otherwise OR enable slow emulation export VLLM_USE_NVFP4_CT_EMULATIONS=1

This model requires ~39.8GiB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

NVFP4 writeups:

📥 Usage & Running Instructions

The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Hardware

As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later). Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.

You may still run this model with emulation albeit slowly by setting export VLLM_USE_NVFP4_CT_EMULATIONS=1 otherwise use the alternative mratsim/Wayfarer-Large-70B-NVFP4A16

Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

This model is recommended with "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)

Running script

# Model configuration (Mandatory)
MODEL="mratsim/L3.3-Ignition-v0.1-70B-NVFP4"
MODELNAME="L3.3-Ignition-v0.1-70B"
GPU_UTIL=0.45
NUM_GPUS=2

# Sampling configuration (Optional, if departing from `generation_config.json`)
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.03, "repetition_penalty": 1.03}'

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --tensor-parallel-size "${NUM_GPUS}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"

ℹ️ The FlashInfer backend may fail with an error similar to Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.

A workaround is running a sed replacement command within vllm install to increase buffer space

sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py

This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

and calibrated on 64 samples, 8192 sequence length of Gryphe/Opus-WritingPrompts

NVFP4 quantization requires very few number of samples, llmcompressor uses 20 in their examples. Comparatively 512 is recommended for GPTQ and 64 for AWQ (https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf)

Downloads last month
13
Safetensors
Model size
41B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mratsim/L3.3-Ignition-v0.1-70B-NVFP4

Dataset used to train mratsim/L3.3-Ignition-v0.1-70B-NVFP4