CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
Abstract
A new benchmark, CodeFuse-CR-Bench, evaluates LLMs in repository-level code review with comprehensive, context-rich data and a novel evaluation framework.
Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.
Community
We present CodeFuse-CR-Bench, the first comprehensive benchmark for repository-level code review (CR) evaluation, including 601 high-quality instances from 70 Python projects. Each instance provides rich, multi-faceted context including the associated issue, pull request (PR) details, and repository state, enabling end-to-end evaluation. We also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. Extensive experiments reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini-2.5-Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Benchmarking and Studying the LLM-based Code Review (2025)
- LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering (2025)
- RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment (2025)
- SWE-QA: Can Language Models Answer Repository-level Code Questions? (2025)
- RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models (2025)
- MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation (2025)
- A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper