T-pro-it-2.1-FP8

Main BF16 model: t-tech/T-pro-it-2.1

🚨 Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.

T‑pro‑it‑2.1‑FP8 is a fine‑grained FP8‑quantised version of T‑pro‑it‑2.1 (built on the Qwen‑3 family). It delivers identical capabilities with roughly half the memory footprint and higher inference speed.

Description

T-pro-it-2.1 — is an efficient russian model built upon the Qwen 3 model family with improved instruction following and tool-calling capabilities compared to T-pro-it-2.0. Outperforms Qwen3-32B in tool calling scenarios, which is essential for agentic applications. Built for both general tasks and complex workflows.

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

📊 Benchmarks

Model	Ru Arena Hard	ruIFeval*	enIFeval*	enBFCL	ruBFCL	Tau2	ACEBench
T-pro-it-2.1	93.8	80.7	78.4	72.3	66.0	37.6	73.6
T-pro-it-2.1 FP8	93.4	80.7	78.0	72.3	65.7	35.2	72.7

* IFeval metric is mean of 4 values: prompt and instruct levels for strict and loose accuracy.

Note on FP8

For convenience and performance, we have provided fp8-quantized model checkpoint for T-pro-it-2.1, whose name ends with -FP8. The quantization method is fine-grained fp8 quantization with block size of 128. You can find more details in the quantization_config field in config.json.

You can use the T-pro-it-2.1-FP8 model with serveral inference frameworks, including transformers, sglang, and vllm, as the original bfloat16 model. However, please pay attention to the following known issues:

transformers:
- there are currently issues with the "fine-grained fp8" method in transformers for distributed inference. You may need to set the environment variable CUDA_LAUNCH_BLOCKING=1 if multiple devices are used in inference.

Downloads last month: 22

Safetensors

Model size

33B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for t-tech/T-pro-it-2.1-FP8

Base model

Qwen/Qwen3-32B

Finetuned

t-tech/T-pro-it-2.1

Quantized

(2)

this model

Collection including t-tech/T-pro-it-2.1-FP8

T-pro-2.1

Collection

3 items • Updated 3 days ago • 4