Strawberrylemonade-L3-70B-v1.1 (NVFP4A16 quant)
This repo contains Strawberrylemonade-L3-70B-v1.1 quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy.
鈩癸笍 Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs.
Original Model:
- sophosympatheia/Strawberrylemonade-L3-70B-v1.1 Hopper and Blackwell optimized model:
- mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4
This model requires ~39.8GiB of VRAM.
Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.
NVFP4 writeups:
- https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
- https://arxiv.org/pdf/2509.25149
馃摜 Usage & Running Instructions
The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.
Hardware
As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later). Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
Recommendations
It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)
This model is recommended with "min-p" sampling, this sampling is available through
both the oldest Text completions API and the Chat completions API (and there is a new Response API),
however most LLM frontends only support modifying min-p when using Text completions.
You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)
Running script
# Model configuration (Mandatory)
MODEL="mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4A16"
MODELNAME="Strawberrylemonade-L3-70B-v1.1"
GPU_UTIL=0.45
NUM_GPUS=2
# Sampling configuration (Optional, if departing from `generation_config.json`)
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.03, "repetition_penalty": 1.03}'
# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1
# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
export VLLM_ATTENTION_BACKEND=FLASHINFER
vllm serve "${MODEL}" \
--served-model-name "${MODELNAME}" \
--tensor-parallel-size "${NUM_GPUS}" \
--gpu-memory-utilization ${GPU_UTIL} \
--override-generation-config "${SAMPLER_OVERRIDE}"
鈩癸笍 The FlashInfer backend may fail with an error similar to
Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.pyThis will be fixed by PR https://github.com/vllm-project/vllm/pull/25344
馃敩 Quantization method
The llmcompressor library was used with the following recipe:
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: NVFP4A16
NVFP4A16 doesn't require any calibration dataset.
- Downloads last month
- 9