AblationBench
This is a collection of datasets used to evaluate language models in the task of ablation planning in empirical AI research.
Viewer • Updated • 83 • 44Note ResearcherAblationBench is a benchmark aim to assist authors in proposing ablation plans based on a paper's written method section. It contains 83 AI conference papers, alongside human-annotated ablation found in the original papers.
ai-coscientist/reviewer-ablation-bench
Viewer • Updated • 6.26k • 15Note ReviewerAblationBench is a benchmark aim to assist reviewers in finding missing ablation experiments from a paper's submission. It contains 350 ICLR submissions from 2023-2025, alongside official reviews that contain suggestions for missing ablation experiments.
ai-coscientist/researcher-ablation-judge-eval
Viewer • Updated • 63 • 20Note ResearcherAblationJudgeEval is a benchmark aim to support the automatic evaluation framework using LM judges for ResearcherAblationBench. It contains 63 ablation plans, alongside human annotation for whether they are found in the ablations from the original papers.
ai-coscientist/reviewer-ablation-judge-eval
Viewer • Updated • 60 • 4Note ReviewerAblationJudgeEval is a benchmark aim to support the automatic evaluation framework using LM judges for ReviewerAblationBench. It contains 60 missing ablation plans, alongside human annotation for whether they are found in one of the reviewers' official comments.