AntAngelMed-eagle3

Model Overview

AntAngelMed-eagle3 is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability.

The model is trained on high-quality medical datasets, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments.

Key Features

  • Speculative Sampling Optimization: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4
  • Outstanding Throughput Performance: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+%
  • Production-Grade Optimization: Achieving 3267 tokens/s output throughput on single NVIDIA H200

Performance

Speculative Sampling Efficiency

Average Acceptance Length with speculative length of 4:

Benchmark Average Acceptance Length
HumanEval 2.816
GSM8K 3.24
Math-500 3.326
Med_MCPA 2.600
Health_Bench 2.446

Throughput Improvement

Using FP8 quantization + EAGLE3 optimization, throughput improvement compared to FP8-only at 16 concurrency:

Benchmark Throughput Improvement
HumanEval +67.3%
GSM8K +58.6%
Math-500 +89.8%
Med_MCPA +46%
Health_Bench +45.3%

Ultimate Inference Performance

  • Hardware Environment: NVIDIA H200 single GPU

1 2 3

Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200

Technical Specifications

  • Model Architecture: LlamaForCausalLMEagle3
  • Number of Layers: 1 layer (Draft Model)
  • Hidden Size: 4096
  • Attention Heads: 32 (KV heads: 8)
  • Intermediate Size: 14336
  • Vocabulary Size: 157,184
  • Max Position Embeddings: 32,768
  • Data Type: bfloat16

Quick Start

Requirements

  • H200-class Computational Performance
  • CUDA 12.0+
  • PyTorch 2.0+

Installation

pip install sglang==0.5.6

and include PR https://github.com/sgl-project/sglang/pull/15119

Inference with SGLang

python3 -m sglang.launch_server  \
    --model-path MedAIBase/AntAngelMed-FP8 \
    --host 0.0.0.0 --port 30012  \
    --trust-remote-code  \
    --attention-backend fa3  \
    --mem-fraction-static 0.9 \
    --tp-size 1  \
    --speculative-algorithm EAGLE3  \
    --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \
    --speculative-num-steps 3  \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 4 

Training Data

  • Data Quality: Rigorously filtered and cleaned to ensure high-quality training data

Use Cases

  • High-concurrency inference services
  • Real-time dialogue systems
  • Code generation and completion
  • Mathematical reasoning and computation
  • Production environments requiring low-latency responses

Open Source Contribution

We actively contribute back to the open-source community. Related optimization achievements have been submitted to the SGLang community:

Limitations and Notes

  • This model is a draft model that needs to be used with a target model to achieve speculative sampling
  • FP8 quantization is recommended for optimal performance
  • Performance may vary across different hardware platforms
  • Medical domain applications must comply with relevant regulations; model outputs are for reference only

License

This code repository is licensed under the MIT License.

Downloads last month
10
Safetensors
Model size
0.4B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for MedAIBase/AntAngelMed-eagle3

Finetuned
(1)
this model