|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-72B-Instruct |
|
|
tags: |
|
|
- evaluation |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
# LMUnit: Fine-grained Evaluation with Natural Language Unit Tests |
|
|
<img src="Contextual_AI_Brand_Mark_Dark.png" width="10%" alt="Contextual_AI"/> |
|
|
|
|
|
</div> |
|
|
|
|
|
<hr> |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://arxiv.org/abs/2412.13091) |
|
|
[](https://contextual.ai/research/lmunit) |
|
|
[](https://github.com/ContextualAI/LMUnit) |
|
|
[](https://huggingface.co/collections/ContextualAI/lmunit) |
|
|
|
|
|
</div> |
|
|
|
|
|
**LMUnit** is a state-of-the-art language model that is optimized for evaluating natural language unit tests. It takes three inputs: a prompt, a response, and a unit test. It then produces a continuous score between 1 and 5 where higher scores indicate that the response better satisfies the unit test criteria. |
|
|
|
|
|
The LMUnit model achieves leading averaged performance across preference, direct scoring, and fine-grained unit test evaluation tasks, as measured by FLASK and BiGGen Bench, and performs on par with frontier models for coarse evaluation of long-form responses (per LFQA). The model also demonstrates exceptional alignment with human preferences, ranking in the top 5 of the RewardBench benchmark with 93.5% accuracy and in top #2 of RewardBench2 with 82.1% accuracy. |
|
|
|
|
|
For more details, please check out the [blogpost](https://contextual.ai/research/lmunit) or the [paper](https://arxiv.org/abs/2412.13091). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
LMUnit is highly performant and versatile because of key methodologies in its training approach: |
|
|
|
|
|
- **Multi-Objective Training:** The model simultaneously learns from multiple evaluation signals, including pairwise comparisons between responses, direct quality ratings, and specialized criteria-based judgments. |
|
|
- **Synthetic Data Generation:** We developed a sophisticated pipeline to generate training data that captures nuanced, fine-grained evaluation criteria and subtle quality distinctions between responses across a wide range of use cases and scenarios. |
|
|
- **Importance Weighting:** We demonstrate that adjusting unit test weights to reflect the relative importance of different criteria achieves results that better align with human preferences. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Contextual AI |
|
|
- **Language(s) (NLP):** English |
|
|
- **Finetuned from model:** Qwen2.5-72B |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** https://github.com/ContextualAI/LMUnit |
|
|
- **Paper:** https://arxiv.org/abs/2412.13091 |
|
|
|
|
|
## ๐ Model Quick Start |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install lmunit |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from lmunit import LMUnit |
|
|
from vllm import SamplingParams |
|
|
|
|
|
# Initialize LMUnit |
|
|
model = LMUnit( |
|
|
model_path="ContextualAI/LMUnit-qwen2.5-72b", |
|
|
tp_size=4 |
|
|
) |
|
|
|
|
|
# Define evaluation |
|
|
query = "What is the capital of France?" |
|
|
response = "Paris" |
|
|
unit_test = "Does the response correctly identify the capital city?" |
|
|
|
|
|
# Generate score |
|
|
sampling_params = SamplingParams(temperature=0.0, max_tokens=10, logprobs=20) |
|
|
prompt = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}" |
|
|
output = model.generate(prompt, sampling_params) |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
### Alternative: Using Transformers |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model |
|
|
tokenizer = AutoTokenizer.from_pretrained("ContextualAI/LMUnit-qwen2.5-72b") |
|
|
model = AutoModelForCausalLM.from_pretrained("ContextualAI/LMUnit-qwen2.5-72b") |
|
|
|
|
|
# Prepare prompt |
|
|
query = "What is the capital of France?" |
|
|
response = "Paris" |
|
|
unit_test = "Does the response correctly identify the capital city?" |
|
|
content = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}" |
|
|
|
|
|
messages = [{"role": "user", "content": content}] |
|
|
inputs = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, |
|
|
tokenize=True, |
|
|
return_dict=True, |
|
|
return_tensors="pt", |
|
|
).to(model.device) |
|
|
|
|
|
# Generate |
|
|
outputs = model.generate(**inputs, max_new_tokens=40) |
|
|
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
For more examples, see our [GitHub repository](https://github.com/ContextualAI/LMUnit). |
|
|
|
|
|
### Evaluation - Results |
|
|
|
|
|
| Model | Flask | BiGGen-Bench | Human-Internal | InfoBench | RB | LFQA | RB2 | |
|
|
|:------|------:|-------------:|---------------:|----------:|----:|------:|----:| |
|
|
| **LMUnit-LLaMA-3.1-70B** | 72.03 | 67.69 | 93.63 | 89.00 | 91.56 | 76.15 | 80.5 | |
|
|
| **LMUnit-Qwen2.5-72B** | 73.85 | 69.56 | 94.44 | 88.67 | 91.13 | 73.85 | 82.1 | |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to cite our paper: |
|
|
```bibtex |
|
|
@inproceedings{saadfalcon2025lmunit, |
|
|
title={{LMUnit}: Fine-grained Evaluation with Natural Language Unit Tests}, |
|
|
author={Jon Saad-Falcon and Rajan Vivek and William Berrios and Nandita Shankar Naik and Matija Franklin and Bertie Vidgen and Amanpreet Singh and Douwe Kiela and Shikib Mehri}, |
|
|
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025}, |
|
|
year={2025}, |
|
|
url={https://arxiv.org/abs/2412.13091} |
|
|
} |
|
|
``` |