Abstract
Large Language Models fine-tuned as scalable judges (JudgeLM) achieve state-of-the-art performance in evaluating open-ended benchmarks through a comprehensive dataset and benchmark, enhancing judgment efficiency and accuracy.
Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Split and Merge: Aligning Position Biases in Large Language Model based Evaluators (2023)
- Improving Automatic VQA Evaluation Using Large Language Models (2023)
- Generative Judge for Evaluating Alignment (2023)
- Efficient Finetuning Large Language Models For Vietnamese Chatbot (2023)
- OpenChat: Advancing Open-source Language Models with Mixed-Quality Data (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
@LianghuiZhu @xinggangw @xinlongwang What is the proper way to prompt JudgeLM for single answer + reference free evaluation?
The paper talks about this, but I don't see any explicit examples in the appendix nor in the hosted demo
@LianghuiZhu @xinggangw @xinlongwang What is the proper way to prompt JudgeLM for single answer + reference free evaluation?
The paper talks about this, but I don't see any explicit examples in the appendix nor in the hosted demo
Greetings!
As mentioned in the 6.4 Extensions of JudgeLM - Grading a single answer
, the judging of a single answer needs the corresponding reference answer as a full-grade one. Therefore, judging a single answer always needs a reference answer.
Best regards,
lianghui
Thanks!
JudgeLM: Revolutionizing AI Evaluation with Fine-Tuned Large Language Models
Links š:
š Subscribe: https://www.youtube.com/@Arxflix
š Twitter: https://x.com/arxflix
š LMNT (Partner): https://lmnt.com/
Models citing this paper 3
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper