NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published 14 days ago • 43
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published 14 days ago • 43
How Far Are We from Genuinely Useful Deep Research Agents? Paper • 2512.01948 • Published 27 days ago • 53
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper • 2511.18538 • Published Nov 23 • 276
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper • 2511.18538 • Published Nov 23 • 276
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper • 2511.03146 • Published Nov 5 • 7
RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization Paper • 2511.04285 • Published Nov 6 • 7
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published Nov 10 • 17
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains Paper • 2511.10984 • Published Nov 14 • 4
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds Paper • 2511.08892 • Published Nov 12 • 201
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published Nov 10 • 17
VideoScore2: Think before You Score in Generative Video Evaluation Paper • 2509.22799 • Published Sep 26 • 25
Towards Personalized Deep Research: Benchmarks and Evaluations Paper • 2509.25106 • Published Sep 29 • 29
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution Paper • 2509.25301 • Published Sep 29 • 19