File size: 5,274 Bytes
e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 e990756 265d0e1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
library_name: transformers
language:
- en
base_model:
- Qwen/Qwen2.5-72B-Instruct
tags:
- evaluation
---
<div align="center">
# LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
<img src="Contextual_AI_Brand_Mark_Dark.png" width="10%" alt="Contextual_AI"/>
</div>
<hr>
<div align="center">
[](https://arxiv.org/abs/2412.13091)
[](https://contextual.ai/research/lmunit)
[](https://github.com/ContextualAI/LMUnit)
[](https://huggingface.co/collections/ContextualAI/lmunit)
</div>
**LMUnit** is a state-of-the-art language model that is optimized for evaluating natural language unit tests. It takes three inputs: a prompt, a response, and a unit test. It then produces a continuous score between 1 and 5 where higher scores indicate that the response better satisfies the unit test criteria.
The LMUnit model achieves leading averaged performance across preference, direct scoring, and fine-grained unit test evaluation tasks, as measured by FLASK and BiGGen Bench, and performs on par with frontier models for coarse evaluation of long-form responses (per LFQA). The model also demonstrates exceptional alignment with human preferences, ranking in the top 5 of the RewardBench benchmark with 93.5% accuracy and in top #2 of RewardBench2 with 82.1% accuracy.
For more details, please check out the [blogpost](https://contextual.ai/research/lmunit) or the [paper](https://arxiv.org/abs/2412.13091).
## Model Details
LMUnit is highly performant and versatile because of key methodologies in its training approach:
- **Multi-Objective Training:** The model simultaneously learns from multiple evaluation signals, including pairwise comparisons between responses, direct quality ratings, and specialized criteria-based judgments.
- **Synthetic Data Generation:** We developed a sophisticated pipeline to generate training data that captures nuanced, fine-grained evaluation criteria and subtle quality distinctions between responses across a wide range of use cases and scenarios.
- **Importance Weighting:** We demonstrate that adjusting unit test weights to reflect the relative importance of different criteria achieves results that better align with human preferences.
### Model Description
- **Developed by:** Contextual AI
- **Language(s) (NLP):** English
- **Finetuned from model:** Qwen2.5-72B
### Model Sources
- **Repository:** https://github.com/ContextualAI/LMUnit
- **Paper:** https://arxiv.org/abs/2412.13091
## ๐ Model Quick Start
### Installation
```bash
pip install lmunit
```
### Basic Usage
```python
from lmunit import LMUnit
from vllm import SamplingParams
# Initialize LMUnit
model = LMUnit(
model_path="ContextualAI/LMUnit-qwen2.5-72b",
tp_size=4
)
# Define evaluation
query = "What is the capital of France?"
response = "Paris"
unit_test = "Does the response correctly identify the capital city?"
# Generate score
sampling_params = SamplingParams(temperature=0.0, max_tokens=10, logprobs=20)
prompt = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}"
output = model.generate(prompt, sampling_params)
print(output)
```
### Alternative: Using Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model
tokenizer = AutoTokenizer.from_pretrained("ContextualAI/LMUnit-qwen2.5-72b")
model = AutoModelForCausalLM.from_pretrained("ContextualAI/LMUnit-qwen2.5-72b")
# Prepare prompt
query = "What is the capital of France?"
response = "Paris"
unit_test = "Does the response correctly identify the capital city?"
content = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}"
messages = [{"role": "user", "content": content}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
# Generate
outputs = model.generate(**inputs, max_new_tokens=40)
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])
print(result)
```
For more examples, see our [GitHub repository](https://github.com/ContextualAI/LMUnit).
### Evaluation - Results
| Model | Flask | BiGGen-Bench | Human-Internal | InfoBench | RB | LFQA | RB2 |
|:------|------:|-------------:|---------------:|----------:|----:|------:|----:|
| **LMUnit-LLaMA-3.1-70B** | 72.03 | 67.69 | 93.63 | 89.00 | 91.56 | 76.15 | 80.5 |
| **LMUnit-Qwen2.5-72B** | 73.85 | 69.56 | 94.44 | 88.67 | 91.13 | 73.85 | 82.1 |
## Citation
If you find our work helpful, feel free to cite our paper:
```bibtex
@inproceedings{saadfalcon2025lmunit,
title={{LMUnit}: Fine-grained Evaluation with Natural Language Unit Tests},
author={Jon Saad-Falcon and Rajan Vivek and William Berrios and Nandita Shankar Naik and Matija Franklin and Bertie Vidgen and Amanpreet Singh and Douwe Kiela and Shikib Mehri},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
year={2025},
url={https://arxiv.org/abs/2412.13091}
}
``` |