This Eagle3 Draft Model was trained using SpecForge, you can see the exact running parameters here: my GitHub.

This model was trained because when I was testing out the nvidia one, it didn't work as well as I thought should be. And I just saw that the original EAGLE-3 author recommended using SpecForge to do draft model training in the README. So here it is.

Training Settings

  • Dataset: Using the built-in dataset pipeline provided by SpecForge, in this case, I used this sharegpt one
  • All other settings are the same as the examples/run_llama4_eagle3_online.sh provided by SpecForge, just ttt-length was shrinked from 7 to 6.
  • This model was trained using 8 Nvidia H200 GPU for 7 days (7 epochs).
  • As a side note, using ttt-length=6 will result in OOM error when running the second consecutive epoch. So I just restarted (and resumed) every 1 epoch (each epoch takes complete 24hrs anyway).

Inferece Settings & Benchmarks

  • Inferencing Framework: SGLang version 0.4.9.post3

  • Inference Backend: FlashAttn3 (FlashInfer as of now is still not available for Llama4)

  • Hardware: 8 * Nvidia H200 gpus

  • Workflow:

    • First I started by finding the best speculative settings (for nvidia one, and my model will be using the same params), using the benchmark provided by sglang:

      python scripts/playground/bench_speculative.py \
        --model-path ../cache/meta-llama/Llama-4-Maverick-17B-128E \
        --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
        --steps <different step sizes, e.g., '3 4 5'> \
        --topk <different top k, e.g., '8 10 12'> \
        --num_draft_tokens <different step sizes, e.g., '12 24 36'> \
        --batch-size <different batch sizes, e.g., '1 2 4'> \
        --trust-remote-code
      
    • These are the grid search result:

      ...
      {"batch_size": 1, "steps": 3, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.042, "step_time": 0.00984, "speed": 105.927, "completion_tokens": 512.0}
      {"batch_size": 1, "steps": 4, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.042, "step_time": 0.01037, "speed": 100.455, "completion_tokens": 512.0}
      {"batch_size": 1, "steps": 5, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.042, "step_time": 0.01084, "speed": 96.094, "completion_tokens": 512.0}
      {"batch_size": 2, "steps": 3, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.044, "step_time": 0.01112, "speed": 93.896, "completion_tokens": 512.0}
      {"batch_size": 2, "steps": 4, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.044, "step_time": 0.01168, "speed": 89.335, "completion_tokens": 512.0}
      {"batch_size": 2, "steps": 5, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.044, "step_time": 0.01217, "speed": 85.767, "completion_tokens": 512.0}
      {"batch_size": 4, "steps": 3, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.041, "step_time": 0.01341, "speed": 77.611, "completion_tokens": 512.0}
      {"batch_size": 4, "steps": 4, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.041, "step_time": 0.01406, "speed": 74.038, "completion_tokens": 512.0}
      {"batch_size": 4, "steps": 5, "topk": 8, "num_draft_tokens": 10, "acc_length": 1.041, "step_time": 0.01463, "speed": 71.149, "completion_tokens": 512.0}
      {"batch_size": 1, "steps": 2, "topk": 12, "num_draft_tokens": 24, "acc_length": 1.053, "step_time": 0.01064, "speed": 98.92, "completion_tokens": 512.0}
      {"batch_size": 1, "steps": 3, "topk": 12, "num_draft_tokens": 24, "acc_length": 1.052, "step_time": 0.0108, "speed": 97.34, "completion_tokens": 512.0}
      {"batch_size": 1, "steps": 3, "topk": 12, "num_draft_tokens": 36, "acc_length": 1.052, "step_time": 0.01172, "speed": 89.79, "completion_tokens": 512.0}
      ...
      

      And inferred from the results above,

      • batch_size=1
      • steps=3
      • topk=8
      • num_draft_tokens=10

      seems to yield the best results.

  • Benchmark numbers:

    Item Base Model Eagle3 (NVDA) Eagle3 (Mine)
    Throughput 44.8 tok/s 71.2 tok/s 105.93 tok/s
    Mean TTFT 161.49 ms 51.74 ms 46.81 ms
    Mean TPOT 5.16 ms 4.15 ms 2.48 ms

    As you can see in the table above, it came as a surprise to me that my model was way faster compared to the nvidia one, but at the same time I am very concerned that I might have done something so wrong that the benchmark itself is biased. Even if the benchmark results are valid, I have no clue how my model turns out to be faster. (p.s. If someone knows the reason, please don't hesitate to reach out to me) (I forgot to screenshot the terminal output when the inference rounds were done, sp you just gotta trust me with the table results above.)

  • Using this model

    • First, install the framework of your choice (either vllm or sglang should be fine, but as of 19-Aug-2025, vllm still doesn't support Llama4 with Eagle3 very well, also that SpecForge were meant for SGlang).
    • Set the --speculative-draft-model-path flag in the SGLang launching config to this seanmamasde/llama4-maverick-17B-128E-eagle3-sglang, along with --speculative-num-steps 3 --speculative-eagle-topk 8 --speculative-num-draft-tokens 10 for best results (optional).
    • You're good to go!
Downloads last month
-
Safetensors
Model size
2B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for seanmamasde/llama4-maverick-17B-128E-eagle3-sglang