This repo only contains the AttnGates' weights for Qwen2.5-32B-Instruct Model.
SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
Original Github Repo
https://github.com/microsoft/SeerAttention.
Evaluation Results
PG19 PPL
| density | 8192 | 16384 | 32768 |
|---|---|---|---|
| 0.1 | 8.11 | 7.76 | 7.72 |
| 0.2 | 7.85 | 7.62 | 7.62 |
| 0.3 | 7.77 | 7.58 | 7.59 |
| 0.4 | 7.75 | 7.57 | 7.58 |
| 0.5 | 7.73 | 7.56 | 7.57 |
| 1.0 | 7.72 | 7.55 | 7.57 |
LongBench
| Task | 0-4k (Dense / Sparse) | 4-8k (Dense / Sparse) | 8k+ (Dense / Sparse) |
|---|---|---|---|
| hotpotqa | 74.73 / 75.73 | 66.92 / 67.28 | 66.05 / 65.59 |
| trec | 68.00 / 68.00 | 79.00 / 78.00 | 80.00 / 80.00 |
| 2wikimqa | 71.01 / 71.01 | 61.59 / 61.26 | 49.36 / 49.59 |
| multi_news | 23.60 / 23.37 | 21.09 / 21.12 | 20.55 / 20.55 |
| lcc | 58.20 / 58.84 | 52.76 / 50.60 | 53.98 / 54.57 |
| qasper | 50.23 / 50.25 | 38.80 / 38.72 | 38.48 / 39.22 |
| passage_count | 31.00 / 31.00 | 18.00 / 18.00 | 16.00 / 20.00 |
| passage_retrieval_en | 100.0 / 100.0 | 100.0 / 99.00 | 99.00 / 99.00 |
| triviaqa | 84.68 / 84.68 | 88.79 / 89.42 | 86.37 / 85.43 |
| samsum | 41.16 / 41.26 | 41.13 / 41.65 | 46.88 / 46.36 |
| gov_report | 29.90 / 30.09 | 30.70 / 30.91 | 29.35 / 29.46 |
| repobench-p | 42.98 / 42.90 | 32.73 / 33.25 | 36.82 / 35.37 |
| multifieldqa_en | 56.26 / 56.51 | 46.73 / 45.86 | 50.99 / 50.99 |
| averaged score | 56.29 / 56.43 | 52.17 / 51.93 | 51.83 / 52.01 |
| averaged density | 0.895 | 0.682 | 0.409 |
LongBenchV2 CoT Benchmark
All the SeerAttention models run with threshold=5e-4.
For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.
| Model | Overall | Easy | Hard | Short | Medium | Long |
|---|---|---|---|---|---|---|
| Llama-3.1-8B-Instruct | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 |
| SeerAttention-Llama-3.1-8B | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 |
| Qwen2.5-14B-Instruct | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 |
| SeerAttention-Qwen2.5-14B | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 |
| Qwen2.5-32B-Instruct | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 |
| SeerAttention-Qwen2.5-32B | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 |
| DeepSeek-R1-Distill-Qwen-14B | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 |
| SeerAttention-DeepSeek-R1-Distill-Qwen-14B | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 |
| DeepSeek-R1-Distill-Qwen-32B | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 |
| SeerAttention-DeepSeek-R1-Distill-Qwen-32B | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |
- Downloads last month
- 12