This Eagle3 Draft Model was trained using SpecForge, you can see the exact running parameters here: my GitHub.

This model was trained because when I was testing out the nvidia one, it didn't work as well as I thought should be. And I just saw that the original EAGLE-3 author recommended using SpecForge to do draft model training in the README. So here it is.

Training Settings

Dataset: Using the built-in dataset pipeline provided by SpecForge, in this case, I used this sharegpt one
All other settings are the same as the examples/run_llama4_eagle3_online.sh provided by SpecForge, just ttt-length was shrinked from 7 to 6.
This model was trained using 8 Nvidia H200 GPU for 7 days (7 epochs).
As a side note, using ttt-length=6 will result in OOM error when running the second consecutive epoch. So I just restarted (and resumed) every 1 epoch (each epoch takes complete 24hrs anyway).

Inferece Settings & Benchmarks

Inferencing Framework: SGLang version 0.4.9.post3
Inference Backend: FlashAttn3 (FlashInfer as of now is still not available for Llama4)
Hardware: 8 * Nvidia H200 gpus

Workflow:

First I started by finding the best speculative settings (for nvidia one, and my model will be using the same params), using the benchmark provided by sglang:

python scripts/playground/bench_speculative.py \
  --model-path ../cache/meta-llama/Llama-4-Maverick-17B-128E \
  --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
  --steps <different step sizes, e.g., '3 4 5'> \
  --topk <different top k, e.g., '8 10 12'> \
  --num_draft_tokens <different step sizes, e.g., '12 24 36'> \
  --batch-size <different batch sizes, e.g., '1 2 4'> \
  --trust-remote-code

These are the grid search result:

...
{"batch_size": 1, "steps": 3, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.042, "step_time": 0.00984, "speed": 105.927, "completion_tokens": 512.0}
{"batch_size": 1, "steps": 4, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.042, "step_time": 0.01037, "speed": 100.455, "completion_tokens": 512.0}
{"batch_size": 1, "steps": 5, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.042, "step_time": 0.01084, "speed": 96.094, "completion_tokens": 512.0}
{"batch_size": 2, "steps": 3, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.044, "step_time": 0.01112, "speed": 93.896, "completion_tokens": 512.0}
{"batch_size": 2, "steps": 4, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.044, "step_time": 0.01168, "speed": 89.335, "completion_tokens": 512.0}
{"batch_size": 2, "steps": 5, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.044, "step_time": 0.01217, "speed": 85.767, "completion_tokens": 512.0}
{"batch_size": 4, "steps": 3, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.041, "step_time": 0.01341, "speed": 77.611, "completion_tokens": 512.0}
{"batch_size": 4, "steps": 4, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.041, "step_time": 0.01406, "speed": 74.038, "completion_tokens": 512.0}
{"batch_size": 4, "steps": 5, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.041, "step_time": 0.01463, "speed": 71.149, "completion_tokens": 512.0}
{"batch_size": 1, "steps": 2, "topk": 12, "num_draft_tokens": 24, "acc_length": 1.053, "step_time": 0.01064, "speed": 98.92, "completion_tokens": 512.0}
{"batch_size": 1, "steps": 3, "topk": 12, "num_draft_tokens": 24, "acc_length": 1.052, "step_time": 0.0108, "speed": 97.34, "completion_tokens": 512.0}
{"batch_size": 1, "steps": 3, "topk": 12, "num_draft_tokens": 36, "acc_length": 1.052, "step_time": 0.01172, "speed": 89.79, "completion_tokens": 512.0}
...

And inferred from the results above,

batch_size=1
steps=3
topk=8
num_draft_tokens=10

seems to yield the best results.

Benchmark numbers:

Item Base Model Eagle3 (NVDA) Eagle3 (Mine)

Throughput 44.8 tok/s 71.2 tok/s 105.93 tok/s

Mean TTFT 161.49 ms 51.74 ms 46.81 ms

Mean TPOT 5.16 ms 4.15 ms 2.48 ms

As you can see in the table above, it came as a surprise to me that my model was way faster compared to the nvidia one, but at the same time I am very concerned that I might have done something so wrong that the benchmark itself is biased. Even if the benchmark results are valid, I have no clue how my model turns out to be faster. (p.s. If someone knows the reason, please don't hesitate to reach out to me) (I forgot to screenshot the terminal output when the inference rounds were done, sp you just gotta trust me with the table results above.)
Using this model
- First, install the framework of your choice (either vllm or sglang should be fine, but as of 19-Aug-2025, vllm still doesn't support Llama4 with Eagle3 very well, also that SpecForge were meant for SGlang).
- Set the --speculative-draft-model-path flag in the SGLang launching config to this seanmamasde/llama4-maverick-17B-128E-eagle3-sglang, along with --speculative-num-steps 3 --speculative-eagle-topk 8 --speculative-num-draft-tokens 10 for best results (optional).
- You're good to go!

Item	Base Model	Eagle3 (NVDA)	Eagle3 (Mine)
Throughput	44.8 tok/s	71.2 tok/s	105.93 tok/s
Mean TTFT	161.49 ms	51.74 ms	46.81 ms
Mean TPOT	5.16 ms	4.15 ms	2.48 ms

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for seanmamasde/llama4-maverick-17B-128E-eagle3-sglang

Base model

meta-llama/Llama-4-Maverick-17B-128E

Finetuned

meta-llama/Llama-4-Maverick-17B-128E-Instruct

Finetuned

(6)

this model