πŸ€— Hugging Face   |   πŸ€– ModelScope    |   πŸ™ Experience Now

Introduction

Ling-1T is the first flagship non-thinking model in the Ling 2.0 series, featuring 1 trillion total parameters with β‰ˆ 50 billion active parameters per token. Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of efficient reasoning and scalable cognition.

Pre-trained on 20 trillion+ high-quality, reasoning-dense tokens, Ling-1T-base supports up to 128K context length and adopts an evolutionary chain-of-thought (Evo-CoT) process across mid-training and post-training. This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve state-of-the-art performance on multiple complex reasoning benchmarksβ€”balancing accuracy and efficiency.

Flagship-Level Efficient Reasoning

We comprehensively evaluated Ling-1T against leading flagship models, including both open-source giants (e.g., DeepSeek-V3.1-Terminus, Kimi-K2-Instruct-0905) and closed-source APIs (GPT-5-main, Gemini-2.5-Pro). Across code generation, software development, competition-level mathematics, professional math, and logical reasoning, Ling-1T consistently demonstrates superior complex reasoning ability and overall advantage.

In the AIME 25 benchmark, Ling-1T extends the Pareto frontier of reasoning accuracy vs. reasoning length, showcasing its strength in β€œefficient thinking and precise reasoning.”

Aesthetic Understanding and Front-End Generation

Ling-1T excels in visual reasoning and front-end code generation tasks, combining deep semantic understanding with precise code synthesis. We introduce a hybrid Syntax–Function–Aesthetics reward mechanism, enabling the model to not only generate correct and functional code but also demonstrate a refined sense of visual aesthetics. On ArtifactsBench, Ling-1T ranks first among open-source models, and the benchmark visualizations in this card were, in fact, generated by Ling-1T itself.

Emergent Intelligence at Trillion-Scale

Scaling to the trillion-parameter level has revealed strong emergent reasoning and transfer capabilities. For example, in the BFCL V3 tool-use benchmark, Ling-1T achieves β‰ˆ 70% tool-call accuracy with only light instruction tuningβ€”despite having seen no large-scale trajectory data during training. Ling-1T can:

  • Interpret complex natural-language instructions
  • Transform abstract logic into functional visual components
  • Generate cross-platform compatible front-end code
  • Create stylistically controlled marketing copy and multi-lingual text

These capabilities form the foundation for general, collaborative human–AI intelligence, which we aim to advance together with the open-source community through Ling-1T’s release.

Pre-Training at Trillion Scale

The Ling 2.0 architecture was designed from the ground up for trillion-scale efficiency, guided by the Ling Scaling Law (arXiv:2507.17702). This ensures architectural and hyperparameter scalability even under 1e25–1e26 FLOPs of compute.

Key architectural innovations include:

  • 1T total / 50B active parameters with a 1/32 MoE activation ratio
  • MTP layers for enhanced compositional reasoning
  • Aux-loss-free, sigmoid-scoring expert routing with zero-mean updates
  • QK Normalization for fully stable convergence

Ling-1T is the largest FP8-trained foundation model known to date. FP8 mixed-precision training yields 15%+ end-to-end speedup, improved memory efficiency, and maintains ≀ 0.1% loss deviation from BF16 across 1T tokens. A fine-grained, heterogeneous 1F1B interleaved pipeline further boosts utilization by 40 %+. System-level optimizationsβ€”fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetryβ€”ensure stable trillion-scale training.

Pre-training used over 20T high-quality tokens, with > 40% reasoning-dense data in later stages. Mid-training introduced curated chain-of-thought corpora for β€œreasoning pre-activation”, improving downstream reasoning stability. A custom WSM (Warmup–Stable–Merge) LR scheduler(arXiv:2507.17634οΌ‰ with mid-train checkpoint merging simulates LR decay and boosts generalization.

Post-Training and Evo-CoT Optimization

Built upon mid-training reasoning activation, post-training adopts Evo-CoT (Evolutionary Chain-of-Thought) for progressive reasoning enhancement under controllable cost. This approach continually expands the Pareto frontier of reasoning accuracy vs. efficiencyβ€”ideal for reflexive non-thinking models.

For reinforcement learning, we introduce LPO (Linguistics-Unit Policy Optimization) β€”a novel sentence-level policy optimization method. Unlike GRPO (token-level) or GSPO (sequence-level) algorithms, LPO treats sentences as the natural semantic action units, enabling precise alignment between rewards and reasoning behavior. Empirically, LPO offers superior training stability and generalization across reasoning tasks.

Evaluation

Ling-1T has been extensively evaluated across knowledge, code, math, reasoning, agent, and alignment benchmarks. It currently stands as the best open-source flagship non-thinking model, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability.

Model Downloads

You can download Ling-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Model Context Length Download
Ling-1T 32K -> 128K (YaRN) πŸ€— HuggingFace    πŸ€– ModelScope

Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.

Quickstart

πŸš€ Try Online

You can experience Ling-1T online at: ZenMux

πŸ”Œ API Usage

You can also use Ling-1T through API calls:

from openai import OpenAI

# 1. Initialize the OpenAI client
client = OpenAI(
    # 2. Point the base URL to the ZenMux endpoint
    base_url="https://zenmux.ai/api/v1",
    # 3. Replace with the API Key from your ZenMux user console
    api_key="<your ZENMUX_API_KEY>",
)

# 4. Make a request
completion = client.chat.completions.create(
    # 5. Specify the model to use in the format "provider/model-name"
    model="inclusionai/ling-1t",
    messages=[
        {
            "role": "user",
            "content": "What is the meaning of life?"
        }
    ]
)

print(completion.choices[0].message.content)

πŸ€— Hugging Face Transformers

Here is a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-1T"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

πŸ€– ModelScope

If you're in mainland China, we strongly recommend you to use our model from πŸ€– ModelScope.

Deployment

vLLM

vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.

Environment Preparation

pip install vllm==0.11.0

Offline Inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-1T")

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)

llm = LLM(model="inclusionAI/Ling-1T", dtype='bfloat16', trust_remote_code=True)
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)

Online Inference:

vllm serve inclusionAI/Ling-1T \
              --tensor-parallel-size 32 \
              --pipeline-parallel-size 1 \
              --trust-remote-code \
              --gpu-memory-utilization 0.90

# This is only an example, please adjust arguments according to your actual environment.

To handle long context in vLLM using YaRN, we need to follow these two steps:

  1. Add a rope_scaling field to the model's config.json file, for example:
{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}
  1. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.

For detailed guidance, please refer to the vLLM instructions.

SGLang

Environment Preparation

We will later submit our model to SGLang official release, now we can prepare the environment following steps:

pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1

You can use docker image as well:

docker pull lmsysorg/sglang:v0.5.2rc0-cu126

Then you should apply patch to sglang installation:

# patch command is needed, run `yum install -y patch` if needed
patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch

Run Inference

BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:

  • Start server:
python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --host 0.0.0.0 --port $PORT \
    --trust-remote-code \
    --attention-backend fa3

# This is only an example, please adjust arguments according to your actual environment.

MTP is supported for base model, and not yet for chat model. You can add parameter --speculative-algorithm NEXTN to start command.

  • Client:
curl -s http://localhost:${PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

More usage can be found here

Limitations & Future Plans

While Ling-1T has made strong progress in efficient reasoning, cross-domain generalization, and training efficiency, several limitations remain:

  • GQA-based attention: stable for long-context reasoning but relatively costly. Future versions will adopt hybrid attention to improve efficiency.
  • Limited agentic ability: current model has room to grow in multi-turn interaction, long-term memory, and tool use.
  • Instruction and identity issues: occasional deviations or role confusion may occur; future updates will enhance alignment and consistency.

The future versions of Ling-1T will continue to evolve in architecture, reasoning, and alignment, advancing the series toward more general intelligence.

License

This code repository is licensed under the MIT License.

Downloads last month
240
Safetensors
Model size
1000B params
Tensor type
BF16
Β·
F8_E4M3
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support