Benchmarking LLMs for Political Science: A United Nations Perspective Paper • 2502.14122 • Published Feb 19 • 2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval Paper • 2503.04644 • Published Mar 6 • 21
Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation Paper • 2503.00812 • Published Mar 2
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content Paper • 2503.16031 • Published Mar 20 • 3
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper • 2504.13128 • Published Apr 17 • 7
Cost-of-Pass: An Economic Framework for Evaluating Language Models Paper • 2504.13359 • Published Apr 17 • 4
A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models Paper • 2505.07591 • Published May 12 • 11
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs Paper • 2509.04013 • Published Sep 4 • 4