L3.3-Ignition-v0.1-70B-NVFP4A16 / README.md

copy-paste woes - NVFP4A16 can be run without hardware NVFP4

e53f204 verified 1 day ago

3.57 kB

	---
	license: llama3.3
	base_model:
	- invisietch/L3.3-Ignition-v0.1-70B
	pipeline_tag: text-generation
	tags:
	- text adventure
	- roleplay
	- rpg
	- creative writing
	- nvfp4
	- vllm
	- conversational
	- nvfp4a16
	---
	# L3.3-Ignition-v0.1-70B (NVFP4A16 quant)

	This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy.

	> ℹ️ Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs.

	- Original Model:
	- [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
	- Hopper and Blackwell optimized model:
	- [mratsim/L3.3-Ignition-v0.1-70B-NVFP4](https://huggingface.co/mratsim/L3.3-Ignition-v0.1-70B-NVFP4)

	This model requires ~39.8GiB of VRAM.
	Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

	NVFP4 writeups:
	- https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
	- https://arxiv.org/pdf/2509.25149

	## 📥 Usage & Running Instructions

	The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

	### Recommendations

	It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

	This model is recommended with "min-p" sampling, this sampling is available through
	both the oldest Text completions API and the Chat completions API (and there is a new Response API),
	however most LLM frontends only support modifying min-p when using Text completions.
	You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults)

	### Running script

	```bash
	# Model configuration (Mandatory)
	MODEL="mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16"
	MODELNAME="L3.3-Ignition-v0.1-70B"
	GPU_UTIL=0.45
	NUM_GPUS=2

	# Sampling configuration (Optional, if departing from `generation_config.json`)
	SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.03, "repetition_penalty": 1.03}'

	# Prevent vLLM from using 100% CPU when idle (Very Recommended)
	export VLLM_SLEEP_WHEN_IDLE=1

	# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
	export VLLM_ATTENTION_BACKEND=FLASHINFER

	vllm serve "${MODEL}" \
	--served-model-name "${MODELNAME}" \
	--tensor-parallel-size "${NUM_GPUS}" \
	--gpu-memory-utilization ${GPU_UTIL} \
	--override-generation-config "${SAMPLER_OVERRIDE}"
	```

	> ℹ️ The FlashInfer backend may fail with an error similar to
	> `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`.
	>
	> A workaround is running a sed replacement command within vllm install to increase buffer space
	> ```bash
	> sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
	> ```
	> This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344

	## 🔬 Quantization method

	The llmcompressor library was used with the following recipe:

	```yaml
	default_stage:
	default_modifiers:
	QuantizationModifier:
	targets: [Linear]
	ignore: [lm_head]
	scheme: NVFP4A16
	```

	NVFP4A16 doesn't require any calibration dataset.