T-pro-it-2.1-FP8

Main BF16 model: t-tech/T-pro-it-2.1

🚨 Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.

T‑pro‑it‑2.1‑FP8 is a fine‑grained FP8‑quantised version of T‑pro‑it‑2.1 (built on the Qwen‑3 family). It delivers identical capabilities with roughly half the memory footprint and higher inference speed.

Description

T-pro-it-2.1 — is an efficient russian model built upon the Qwen 3 model family with improved instruction following and tool-calling capabilities compared to T-pro-it-2.0. Outperforms Qwen3-32B in tool calling scenarios, which is essential for agentic applications. Built for both general tasks and complex workflows.

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

📊 Benchmarks

Model Ru Arena Hard ruIFeval* enIFeval* enBFCL ruBFCL Tau2 ACEBench
T-pro-it-2.1 93.8 80.7 78.4 72.3 66.0 37.6 73.6
T-pro-it-2.1 FP8 93.4 80.7 78.0 72.3 65.7 35.2 72.7

* IFeval metric is mean of 4 values: prompt and instruct levels for strict and loose accuracy.

Note on FP8

For convenience and performance, we have provided fp8-quantized model checkpoint for T-pro-it-2.1, whose name ends with -FP8. The quantization method is fine-grained fp8 quantization with block size of 128. You can find more details in the quantization_config field in config.json.

You can use the T-pro-it-2.1-FP8 model with serveral inference frameworks, including transformers, sglang, and vllm, as the original bfloat16 model. However, please pay attention to the following known issues:

  • transformers:
    • there are currently issues with the "fine-grained fp8" method in transformers for distributed inference. You may need to set the environment variable CUDA_LAUNCH_BLOCKING=1 if multiple devices are used in inference.
Downloads last month
22
Safetensors
Model size
33B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for t-tech/T-pro-it-2.1-FP8

Base model

Qwen/Qwen3-32B
Quantized
(2)
this model

Collection including t-tech/T-pro-it-2.1-FP8