Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Paper Github Hugging Face Collection Hugging Face Collection

Metrics

Metric/Model Avg. En-De SPA (%) En-De $Acc^*_eq$ En-Es SPA (%) En-Es $Acc^*_eq$ Ja-Zh SPA (%) Ja-Zh $Acc^*_eq$
QwQ 32B 68.3 79.8 46.8 76.1 68.0 91.9 46.9
+ ThinMQM 72.2 (+3.9) 83.2 (+3.4) 52.5 (+5.7) 80.7 (+4.6) 69.2 (+1.2) 91.3 (βˆ’0.6) 56.1 (+9.2)
R1-Distill-Llama-8B 64.9 71.8 42.9 78.5 68.0 84.7 43.5
+ ThinMQM 70.8 (+5.9) 85.5 (+13.7) 48.6 (+5.7) 81.3 (+2.8) 68.2 (+0.2) 90.5 (+5.8) 51.0 (+7.5)
R1-Distill-Qwen-7B 61.1 67.3 42.9 61.0 68.0 83.8 43.5
+ ThinMQM 69.8 (+8.7) 84.5 (+17.2) 48.5 (+5.6) 77.8 (+16.8) 68.0 (+0.0) 89.0 (+5.2) 51.3 (+7.8)

Model & Data Card

πŸ“ Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

@article{zhan2025thinmqm,
      title={Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost}, 
      author={Zhan, Runzhe and Huang, Zhihong and Yang, Xinyi and Chao, Lidia S and Yang, Min and Wong, Derek F},
      year={2025},
      journal = {ArXiv preprint},
      volume = {2510.20780},
      url={https://arxiv.org/abs/2510.20780}, 
}

πŸ“¬ Contact

For questions, feedback, or collaboration opportunities, feel free to reach out:


πŸ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Downloads last month
7
Safetensors
Model size
33B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rzzhan/ThinMQM-32B

Quantizations
2 models

Dataset used to train rzzhan/ThinMQM-32B

Collection including rzzhan/ThinMQM-32B