DeepSeek-R1-0528-Qwen3-8B-KV
Enterprise-grade OCP FP8 quantized DeepSeek-R1-0528-Qwen3-8B for AMD ROCm, end-to-end KV-cache in FP8 with Quark
Introduction
DeepSeek-R1-0528-Qwen3-8B-KV is a full-pipeline, OCP-compliant FP8_e4m3 quant of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B, built with AMD Quark and optimized for AMD Instinct GPUs. This model delivers ~1.8× memory savings and throughput boost vs. FP16, with only a nominal perplexity uplift (≈11 PPL on WikiText2).
Quantization Strategy
- Quantizer: AMD Quark v0.9+
 - Numeric Format: OCP FP8_e4m3 symmetric, per-tensor
 - Scope: All 
Linearlayers (excludinglm_head), activations, and KV cache - Group Size: 128 (block-aligned)
 - Calibration: 128 Pile samples (default)
 - Metadata: scales embedded in JSON + SafeTensors
 
Performance Snapshot
| Metric | FP16 Baseline | FP8_e4m3 Quantized | 
|---|---|---|
| Wikitext2 Perplexity | 10.88 | 11.0 | 
| Memory Footprint | 1.0× | 0.56× | 
Quick Start
Serve with vLLM
Override model’s context:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Serve
HIP_VISIBLE_DEVICES=0 
vllm serve EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV 
  --kv-cache-dtype fp8 
  ----num-scheduler-steps 10
  .. other arguments
Benchmark
python3 /vllm/benchmarks/benchmark_serving.py 
  --backend vllm 
  --model EliovpAI/DeepSeek-R1-0528-Qwen3-8B-KV 
  --dataset-name sharegpt 
  --dataset-path /vllm/ShareGPT_V3_unfiltered_cleaned_split.json 
  --num-prompts 32 
  --random-range-ratio 1.0 
  --percentile-metrics ttft,tpot,itl,e2el 
  --sharegpt-output-len 256
Evaluation
We benchmarked on WikiText2 using vLLM’s /v1/completions PPL metric:
- FP16 (DeepSeek-R1-0528-Qwen3-8) → 10.88 PPL
 - FP8_e4m3 (this model) → 11.00 PPL
 
The ~0.12-point PPL delta yields massive ROI in memory and speed—with virtually imperceptible quality loss in most benchmarks.
License
This model reuses the DeepSeek-R1-0528-Qwen3-8B license.
- Downloads last month
 - 77
 
Model tree for EliovpAI/Deepseek-R1-0528-Qwen3-8B-FP8-KV
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B