mratsim
/

L3.3-Ignition-v0.1-70B-NVFP4

Text Generation

creative writing

8-bit precision

compressed-tensors

Model card Files Files and versions

mratsim commited on 17 days ago

Commit

24c4796

·

verified ·

1 Parent(s): 2f0381e

Mention NVFP4A16 alternative

Files changed (1) hide show

README.md +12 -3

README.md CHANGED Viewed

@@ -16,10 +16,16 @@ tags:
 ---
 # L3.3-Ignition-v0.1-70B (NVFP4 quant)
-This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Blackwell hardware (2x RTX 5090, RTX Pro 6000, B200, B300, ...) with 8-bit-like accuracy.
-Original Model:
   - [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
 This model requires ~39.8GiB of VRAM.
 Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
@@ -37,6 +43,9 @@ The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitabl
 As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
 Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
 ### Recommendations
 It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
@@ -50,7 +59,7 @@ You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to ov
 ```bash
 # Model configuration (Mandatory)
-MODEL="mratsim/L3.3-Ignition-v0.1-70B"
 MODELNAME="L3.3-Ignition-v0.1-70B"
 GPU_UTIL=0.45
 NUM_GPUS=2

 ---
 # L3.3-Ignition-v0.1-70B (NVFP4 quant)
+This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Hopper and Blackwell hardware with 8-bit-like accuracy.
+> ℹ️ This model is limited to Hopper and Blackwell GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
+> Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
+- Original Model:
   - [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
+- RTX 3000s and 4000s GPUs fallback model:
+  - [mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16](https://huggingface.co/mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16)
 This model requires ~39.8GiB of VRAM.
 Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
 As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
 Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
+You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
+otherwise use the alternative [mratsim/Wayfarer-Large-70B-NVFP4A16](https://huggingface.co/mratsim/Nova-70B-NVFP4A16)
 ### Recommendations
 It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
 ```bash
 # Model configuration (Mandatory)
+MODEL="mratsim/L3.3-Ignition-v0.1-70B-NVFP4"
 MODELNAME="L3.3-Ignition-v0.1-70B"
 GPU_UTIL=0.45
 NUM_GPUS=2