Model Overview

  • Model Architecture: ApertusForCausalLM
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: INT4
  • Release Date: 9/22/2025
  • Version: 1.0
  • Model Developers: Red Hat

Quantized version of swiss-ai/Apertus-70B-2509.

Model Optimizations

This model was obtained by quantizing the weights and activations of swiss-ai/Apertus-70B-2509 to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.

Deployment

Use with vLLM

  1. Initialize vLLM server:
vllm serve RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16
  1. Send requests to the server:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16"

messages = [
    {"role": "user", "content": "Give me a short introduction to large language model."},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created with llm-compressor by running the code snippet below.

Model Creation Code
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model_stub = "swiss-ai/Apertus-70B-Instruct-2509"
model_name = model_stub.split("/")[-1]

model = AutoModelForCausalLM.from_pretrained(model_stub, dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_stub)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    ignore=["lm_head"],
    targets="Linear",
    scheme="FP8_dynamic",
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-quantized.w4a16"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")

Evaluation

The model was evaluated on OpenLLM Leaderboard V1, using the following command:

Evaluation Commands

OpenLLM Leaderboard V1:

lm_eval \
  --model vllm \
  --model_args pretrained="RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,gpu_memory_utilization=0.2,enable_chunked_prefill=True \
  --tasks openllm \
  --write_out \
  --batch_size auto \
  --output_path output_dir \
  --show_config

Accuracy

Category Metric swiss-ai/Apertus-70B-Instruct-2509 RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16 Recovery (%)
OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 70.82 70.65 99.8
GSM8K (Strict-Match, 5-shot) 73.69 73.45 99.7
HellaSwag (Acc-Norm, 10-shot) 86.23 85.67 99.4
MMLU (Acc, 5-shot) 69.21 68.25 98.6
TruthfulQA (MC2, 0-shot) 60.31 60.55 100.4
Winogrande (Acc, 5-shot) 80.74 80.03 99.1
Average Score 73.50 73.10 99.5
Downloads last month
264
Safetensors
Model size
11B params
Tensor type
BF16
ยท
I64
ยท
I32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16

Quantized
(9)
this model