arxiv:2509.14856

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

Published on Sep 18

· Submitted by

Ziyin Zhang on Sep 23

Upvote

Authors:

Hang Yu ,

Ziyin Zhang ,

Abstract

A new benchmark, CodeFuse-CR-Bench, evaluates LLMs in repository-level code review with comprehensive, context-rich data and a novel evaluation framework.

AI-generated summary

Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.

View arXiv page View PDF Add to collection

Community

Geralt-Targaryen

Paper author Paper submitter about 1 month ago

We present CodeFuse-CR-Bench, the first comprehensive benchmark for repository-level code review (CR) evaluation, including 601 high-quality instances from 70 Python projects. Each instance provides rich, multi-faceted context including the associated issue, pull request (PR) details, and repository state, enabling end-to-end evaluation. We also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. Extensive experiments reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini-2.5-Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context.

librarian-bot

about 1 month ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.14856 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.14856 in a Space README.md to link it from this page.