|
|
--- |
|
|
license: llama3.3 |
|
|
base_model: |
|
|
- invisietch/L3.3-Ignition-v0.1-70B |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- text adventure |
|
|
- roleplay |
|
|
- rpg |
|
|
- creative writing |
|
|
- nvfp4 |
|
|
- vllm |
|
|
- conversational |
|
|
- nvfp4a16 |
|
|
--- |
|
|
# L3.3-Ignition-v0.1-70B (NVFP4A16 quant) |
|
|
|
|
|
This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy. |
|
|
|
|
|
> ℹ️ Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs. |
|
|
|
|
|
- Original Model: |
|
|
- [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B) |
|
|
- Hopper and Blackwell optimized model: |
|
|
- [mratsim/L3.3-Ignition-v0.1-70B-NVFP4](https://huggingface.co/mratsim/L3.3-Ignition-v0.1-70B-NVFP4) |
|
|
|
|
|
This model requires ~39.8GiB of VRAM. |
|
|
Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism. |
|
|
|
|
|
NVFP4 writeups: |
|
|
- https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ |
|
|
- https://arxiv.org/pdf/2509.25149 |
|
|
|
|
|
## 📥 Usage & Running Instructions |
|
|
|
|
|
The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87) |
|
|
|
|
|
This model is recommended with "min-p" sampling, this sampling is available through |
|
|
both the oldest Text completions API and the Chat completions API (and there is a new Response API), |
|
|
however most LLM frontends only support modifying min-p when using Text completions. |
|
|
You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults) |
|
|
|
|
|
### Running script |
|
|
|
|
|
```bash |
|
|
# Model configuration (Mandatory) |
|
|
MODEL="mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16" |
|
|
MODELNAME="L3.3-Ignition-v0.1-70B" |
|
|
GPU_UTIL=0.45 |
|
|
NUM_GPUS=2 |
|
|
|
|
|
# Sampling configuration (Optional, if departing from `generation_config.json`) |
|
|
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.03, "repetition_penalty": 1.03}' |
|
|
|
|
|
# Prevent vLLM from using 100% CPU when idle (Very Recommended) |
|
|
export VLLM_SLEEP_WHEN_IDLE=1 |
|
|
|
|
|
# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing) |
|
|
export VLLM_ATTENTION_BACKEND=FLASHINFER |
|
|
|
|
|
vllm serve "${MODEL}" \ |
|
|
--served-model-name "${MODELNAME}" \ |
|
|
--tensor-parallel-size "${NUM_GPUS}" \ |
|
|
--gpu-memory-utilization ${GPU_UTIL} \ |
|
|
--override-generation-config "${SAMPLER_OVERRIDE}" |
|
|
``` |
|
|
|
|
|
> ℹ️ The FlashInfer backend may fail with an error similar to |
|
|
> `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`. |
|
|
> |
|
|
> A workaround is running a sed replacement command within vllm install to increase buffer space |
|
|
> ```bash |
|
|
> sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py |
|
|
> ``` |
|
|
> This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 |
|
|
|
|
|
## 🔬 Quantization method |
|
|
|
|
|
The llmcompressor library was used with the following recipe: |
|
|
|
|
|
```yaml |
|
|
default_stage: |
|
|
default_modifiers: |
|
|
QuantizationModifier: |
|
|
targets: [Linear] |
|
|
ignore: [lm_head] |
|
|
scheme: NVFP4A16 |
|
|
``` |
|
|
|
|
|
NVFP4A16 doesn't require any calibration dataset. |