mratsim commited on
Commit
24c4796
·
verified ·
1 Parent(s): 2f0381e

Mention NVFP4A16 alternative

Browse files
Files changed (1) hide show
  1. README.md +12 -3
README.md CHANGED
@@ -16,10 +16,16 @@ tags:
16
  ---
17
  # L3.3-Ignition-v0.1-70B (NVFP4 quant)
18
 
19
- This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Blackwell hardware (2x RTX 5090, RTX Pro 6000, B200, B300, ...) with 8-bit-like accuracy.
20
 
21
- Original Model:
 
 
 
22
  - [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
 
 
 
23
 
24
  This model requires ~39.8GiB of VRAM.
25
  Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
@@ -37,6 +43,9 @@ The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitabl
37
  As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
38
  Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
39
 
 
 
 
40
  ### Recommendations
41
 
42
  It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
@@ -50,7 +59,7 @@ You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to ov
50
 
51
  ```bash
52
  # Model configuration (Mandatory)
53
- MODEL="mratsim/L3.3-Ignition-v0.1-70B"
54
  MODELNAME="L3.3-Ignition-v0.1-70B"
55
  GPU_UTIL=0.45
56
  NUM_GPUS=2
 
16
  ---
17
  # L3.3-Ignition-v0.1-70B (NVFP4 quant)
18
 
19
+ This repo contains L3.3-Ignition-v0.1-70B quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia Hopper and Blackwell hardware with 8-bit-like accuracy.
20
 
21
+ > ℹ️ This model is limited to Hopper and Blackwell GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
22
+ > Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
23
+
24
+ - Original Model:
25
  - [invisietch/L3.3-Ignition-v0.1-70B](https://huggingface.co/invisietch/L3.3-Ignition-v0.1-70B)
26
+ - RTX 3000s and 4000s GPUs fallback model:
27
+ - [mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16](https://huggingface.co/mratsim/L3.3-Ignition-v0.1-70B-NVFP4A16)
28
+
29
 
30
  This model requires ~39.8GiB of VRAM.
31
  Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
 
43
  As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
44
  Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
45
 
46
+ You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
47
+ otherwise use the alternative [mratsim/Wayfarer-Large-70B-NVFP4A16](https://huggingface.co/mratsim/Nova-70B-NVFP4A16)
48
+
49
  ### Recommendations
50
 
51
  It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
 
59
 
60
  ```bash
61
  # Model configuration (Mandatory)
62
+ MODEL="mratsim/L3.3-Ignition-v0.1-70B-NVFP4"
63
  MODELNAME="L3.3-Ignition-v0.1-70B"
64
  GPU_UTIL=0.45
65
  NUM_GPUS=2