Art of Focus: Page-Aware Sparse Attention and Ling 2.0’s Quest for Efficient Context Length Scaling

Community Article Published October 20, 2025

Test-Time Scaling (TTS) has emerged as a crucial technological trend for enhancing the capability ceiling of large models. However, the computational and storage overhead associated with ultra-long context inference grows exponentially, a challenge particularly pronounced in Attention computation and I/O bottlenecks.

To overcome this limitation, we innovatively integrated a high-sparsity Mixture of Expert (MoE) structure with a sparse attention mechanism, based on the Ling 2.0 architecture, to design a sparse attention architecture specifically optimized for long-sequence decoding. Today, we are officially open-sourcing the high-efficiency inference model under this architecture, Ring-mini-sparse-2.0-exp, along with its high-performance implementation on the SGLang framework.

Thanks to the deep synergistic optimization between the architecture and the inference framework, this model achieves nearly a 3x increase in throughput compared to the original Ring-mini-2.0 implementation in complex, long-sequence inference scenarios, while continuously maintaining SOTA (State-of-the-Art) performance across multiple high-difficulty reasoning benchmarks. This achievement provides the open-source community with a lightweight solution that balances efficient inference with powerful context processing capabilities.

ling-sparse-1

Ling 2.0 Sparse: A More Efficient Sparse Attention Architecture

ling-sparse-2

Ling 2.0 Sparse is an efficient sparse attention mechanism specifically engineered to address two major future trends in large language models: Context Length Scaling and Test-Time Scaling.

We have drawn inspiration from the Mixture of Block Attention (MoBA), employing block-wise sparse attention. This involves partitioning the input Key and Value into blocks. Each query then performs a top-k block selection along the head dimension, executing the softmax attention calculation only on the selected blocks, which significantly reduces computational overhead.

Concurrently, we combine the MoBA design with Grouped Query Attention (GQA), allowing query heads within the same group to share the top-k block selection results. This means a single block read can serve the attention calculation for multiple query heads, further mitigating I/O overhead.

However, while the open-source MoBA method effectively accelerates the pre-fill stage, it has been unable to achieve acceleration during the decode stage. This limitation stems from the block-wise sparse attention mechanism used in MoBA: after the Key and Value are partitioned into blocks, an aggregation operation (such as mean pooling) is required to generate the block representations.

For this method to be effective in the decode stage, the block token representations generated during the pre-fill stage must be cached (similar to KV Cache). Yet, current mainstream inference frameworks (like vLLM, SGLang) only support standard KV Cache storage and do not natively support additional block token caching. If these block tokens are not reused in the decode stage and are instead recalculated, the corresponding original Keys must be fully read from the KV Cache. Since MoBA computation only requires a highly sparse subset of Keys, this full Key read introduces substantial redundant data access, significantly increasing the I/O overhead during the decode stage.

To enable MoBA to achieve effective acceleration even in the decode stage, we introduce the page-aware block cache in conjunction with SGLang.

Page-aware Block Cache

Based on the SGLang/vLLM page-attention architecture, we construct a dedicated Block Cache for each KV cache page. The specific implementation is as follows:

  • Pre-fill Stage: The token sequence within each page is treated as a block. Its aggregate representation (e.g., mean-pooled vector) is computed in real-time and stored in the Block Cache.
  • Decode Stage: The system queries the Block Cache (instead of the original KV cache) to retrieve the pre-computed block representations. Combined with the current query’s head-wise top-k routing, only the top-k associated pages are activated.
  • Unified Memory Management: The Block Cache and KV cache share the same page-table indexing mechanism. This ensures consistency in sparse routing, memory allocation, and eviction strategies at the system level, avoiding additional metadata overhead.

ling-sparse-3

ling-sparse-4

This design effectively resolves the I/O bottleneck of MoBA in the decode stage:

  • By reusing the pre-computed block representations, it eliminates the redundant memory access caused by dynamic re-computation.
  • By utilizing the page-table to dynamically mask out unactivated pages, it ensures that only sparsely relevant data is loaded, perfectly matching the page-sparsity characteristic of block-wise sparse attention.

Thanks to the SGLang implementation, the overall advantage of Ring-mini-sparse-2.0-exp over the previous full softmax attention implementation increases drastically as the input and output lengths grow in both the pre-fill and decode stages, leading to an inference speed that can be 3x faster during ultra-long output sequences!

ling-sparse-5

Where to Find Us

We welcome you to visit our open-source repositories to download and use the model.

Ring-mini-sparse-2.0-exp
🤗 Hugging Face: https://huggingface.co/inclusionAI/Ring-mini-sparse-2.0-exp
🤖 ModelScope: https://modelscope.cn/models/inclusionAI/Ring-mini-sparse-2.0-exp
GitHub: https://github.com/inclusionAI/Ring-V2/tree/main/moba
SGLang PR: WIP

Community

Sign up or log in to comment