DeepSeek-R1-0528-Qwen3-8B-KV

Enterprise-grade OCP FP8 quantized DeepSeek-R1-0528-Qwen3-8B for AMD ROCm, end-to-end KV-cache in FP8 with Quark

Introduction

DeepSeek-R1-0528-Qwen3-8B-KV is a full-pipeline, OCP-compliant FP8_e4m3 quant of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B, built with AMD Quark and optimized for AMD Instinct GPUs. This model delivers ~1.8× memory savings and throughput boost vs. FP16, with only a nominal perplexity uplift (≈11 PPL on WikiText2).

Quantization Strategy

Quantizer: AMD Quark v0.9+
Numeric Format: OCP FP8_e4m3 symmetric, per-tensor
Scope: All Linear layers (excluding lm_head), activations, and KV cache
Group Size: 128 (block-aligned)
Calibration: 128 Pile samples (default)
Metadata: scales embedded in JSON + SafeTensors

Performance Snapshot

Metric	FP16 Baseline	FP8_e4m3 Quantized
Wikitext2 Perplexity	10.88	11.0
Memory Footprint	1.0×	0.56×

Quick Start

Serve with vLLM

Override model’s context:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Serve

HIP_VISIBLE_DEVICES=0
vllm serve EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV
--kv-cache-dtype fp8
----num-scheduler-steps 10 .. other arguments

Benchmark

python3 /vllm/benchmarks/benchmark_serving.py
--backend vllm
--model EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV
--dataset-name sharegpt
--dataset-path /vllm/ShareGPT_V3_unfiltered_cleaned_split.json
--num-prompts 32
--random-range-ratio 1.0
--percentile-metrics ttft,tpot,itl,e2el
--sharegpt-output-len 256

Evaluation

We benchmarked on WikiText2 using vLLM’s /v1/completions PPL metric:

FP16 (DeepSeek-R1-0528-Qwen3-8) → 10.88 PPL
FP8_e4m3 (this model) → 11.00 PPL

The ~0.12-point PPL delta yields massive ROI in memory and speed—with virtually imperceptible quality loss in most benchmarks.

License

This model reuses the DeepSeek-R1-0528-Qwen3-8B license.

Downloads last month: 77

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3

Model tree for EliovpAI/Deepseek-R1-0528-Qwen3-8B-FP8-KV

Base model

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Quantized

(89)

this model