Strawberrylemonade-L3-70B-v1.1 (NVFP4A16 quant)

This repo contains Strawberrylemonade-L3-70B-v1.1 quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy.

鈩癸笍 Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs.

Original Model:

This model requires ~39.8GiB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

NVFP4 writeups:

馃摜 Usage & Running Instructions

The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Hardware

As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later). Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.

Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

This model is recommended with "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)

Running script

# Model configuration (Mandatory)
MODEL="mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4A16"
MODELNAME="Strawberrylemonade-L3-70B-v1.1"
GPU_UTIL=0.45
NUM_GPUS=2

# Sampling configuration (Optional, if departing from `generation_config.json`)
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.03, "repetition_penalty": 1.03}'

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --tensor-parallel-size "${NUM_GPUS}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"

鈩癸笍 The FlashInfer backend may fail with an error similar to Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.

A workaround is running a sed replacement command within vllm install to increase buffer space

sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py

This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344

馃敩 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4A16

NVFP4A16 doesn't require any calibration dataset.

Downloads last month
9
Safetensors
Model size
41B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4A16

Quantized
(14)
this model