Mention NVFP4A16 alternative
Browse files
README.md
CHANGED
|
@@ -16,10 +16,16 @@ tags:
|
|
| 16 |
---
|
| 17 |
# L3.3-Ignition-v0.1-70B (NVFP4 quant)
|
| 18 |
|
| 19 |
-
This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Blackwell hardware
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
| 22 |
- [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
This model requires ~39.8GiB of VRAM.
|
| 25 |
Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
|
|
@@ -37,6 +43,9 @@ The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitabl
|
|
| 37 |
As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
|
| 38 |
Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
|
| 39 |
|
|
|
|
|
|
|
|
|
|
| 40 |
### Recommendations
|
| 41 |
|
| 42 |
It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
|
|
@@ -50,7 +59,7 @@ You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to ov
|
|
| 50 |
|
| 51 |
```bash
|
| 52 |
# Model configuration (Mandatory)
|
| 53 |
-
MODEL="mratsim/L3.3-Ignition-v0.1-70B"
|
| 54 |
MODELNAME="L3.3-Ignition-v0.1-70B"
|
| 55 |
GPU_UTIL=0.45
|
| 56 |
NUM_GPUS=2
|
|
|
|
| 16 |
---
|
| 17 |
# L3.3-Ignition-v0.1-70B (NVFP4 quant)
|
| 18 |
|
| 19 |
+
This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Hopper and Blackwell hardware with 8-bit-like accuracy.
|
| 20 |
|
| 21 |
+
> ℹ️ This model is limited to Hopper and Blackwell GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
|
| 22 |
+
> Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
|
| 23 |
+
|
| 24 |
+
- Original Model:
|
| 25 |
- [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
|
| 26 |
+
- RTX 3000s and 4000s GPUs fallback model:
|
| 27 |
+
- [mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16](https://huggingface.co/mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16)
|
| 28 |
+
|
| 29 |
|
| 30 |
This model requires ~39.8GiB of VRAM.
|
| 31 |
Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
|
|
|
|
| 43 |
As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
|
| 44 |
Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
|
| 45 |
|
| 46 |
+
You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
|
| 47 |
+
otherwise use the alternative [mratsim/Wayfarer-Large-70B-NVFP4A16](https://huggingface.co/mratsim/Nova-70B-NVFP4A16)
|
| 48 |
+
|
| 49 |
### Recommendations
|
| 50 |
|
| 51 |
It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
|
|
|
|
| 59 |
|
| 60 |
```bash
|
| 61 |
# Model configuration (Mandatory)
|
| 62 |
+
MODEL="mratsim/L3.3-Ignition-v0.1-70B-NVFP4"
|
| 63 |
MODELNAME="L3.3-Ignition-v0.1-70B"
|
| 64 |
GPU_UTIL=0.45
|
| 65 |
NUM_GPUS=2
|