Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning Paper β’ 2509.22824 β’ Published 28 days ago β’ 20
VideoScore2: Think before You Score in Generative Video Evaluation Paper β’ 2509.22799 β’ Published 28 days ago β’ 24
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use Paper β’ 2509.01055 β’ Published Sep 1 β’ 71
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use Paper β’ 2509.01055 β’ Published Sep 1 β’ 71
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs Paper β’ 2505.20139 β’ Published May 26 β’ 19
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper β’ 2505.19955 β’ Published May 26 β’ 12
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Paper β’ 2505.16175 β’ Published May 22 β’ 41
General-Reasoner: Advancing LLM Reasoning Across All Domains Paper β’ 2505.14652 β’ Published May 20 β’ 23
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation Paper β’ 2504.00043 β’ Published Mar 30 β’ 9
Small Models Struggle to Learn from Strong Reasoners Paper β’ 2502.12143 β’ Published Feb 17 β’ 39
ACECODER: Acing Coder RL via Automated Test-Case Synthesis Paper β’ 2502.01718 β’ Published Feb 3 β’ 29
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Paper β’ 2502.01100 β’ Published Feb 3 β’ 18
On Memorization of Large Language Models in Logical Reasoning Paper β’ 2410.23123 β’ Published Oct 30, 2024 β’ 18
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper β’ 2410.10563 β’ Published Oct 14, 2024 β’ 38