Strawberrylemonade-L3-70B-v1.1 (NVFP4A16 quant)

This repo contains Strawberrylemonade-L3-70B-v1.1 quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy.

ℹ️ Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs.

Original Model:

sophosympatheia/Strawberrylemonade-L3-70B-v1.1 Hopper and Blackwell optimized model:
mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4

This model requires ~39.8GiB of VRAM. Make sure to set an appropriate context size --max-model-len in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism.

NVFP4 writeups:

📥 Usage & Running Instructions

The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Hardware

As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later). Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.

Recommendations

It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87)

This model is recommended with "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use --override-generation-config "${SAMPLER_JSONCONFIG}" to override the sampler (which is a merge of generation_config.json and vLLM defaults)

Running script

# Model configuration (Mandatory)
MODEL="mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4A16"
MODELNAME="Strawberrylemonade-L3-70B-v1.1"
GPU_UTIL=0.45
NUM_GPUS=2

# Sampling configuration (Optional, if departing from `generation_config.json`)
SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0.03, "repetition_penalty": 1.03}'

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
export VLLM_ATTENTION_BACKEND=FLASHINFER

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --tensor-parallel-size "${NUM_GPUS}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --override-generation-config "${SAMPLER_OVERRIDE}"

ℹ️ The FlashInfer backend may fail with an error similar to Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator.

A workaround is running a sed replacement command within vllm install to increase buffer space
sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4A16

NVFP4A16 doesn't require any calibration dataset.

Downloads last month: 9

Safetensors

Model size

41B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mratsim/Strawberrylemonade-L3-70B-v1.1-NVFP4A16

Base model

sophosympatheia/Strawberrylemonade-L3-70B-v1.1

Quantized

(14)

this model