File size: 5,274 Bytes
e990756
 
265d0e1
 
 
 
 
 
e990756
 
265d0e1
e990756
265d0e1
 
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
 
 
 
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
e990756
265d0e1
 
 
e990756
265d0e1
e990756
265d0e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: transformers
language:
- en
base_model:
- Qwen/Qwen2.5-72B-Instruct
tags:
- evaluation
---

<div align="center">

# LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
<img src="Contextual_AI_Brand_Mark_Dark.png" width="10%" alt="Contextual_AI"/>

</div>

<hr>

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-LMUnit-blue)](https://arxiv.org/abs/2412.13091)
[![Blog Post](https://img.shields.io/badge/๐Ÿ“%20Blog-LMUnit-green)](https://contextual.ai/research/lmunit)
[![GitHub](https://img.shields.io/badge/GitHub-LMUnit-black?logo=github)](https://github.com/ContextualAI/LMUnit)
[![Hugging Face Collection](https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Model%20Collection-yellow)](https://huggingface.co/collections/ContextualAI/lmunit)

</div>

**LMUnit** is a state-of-the-art language model that is optimized for evaluating natural language unit tests. It takes three inputs: a prompt, a response, and a unit test. It then produces a continuous score between 1 and 5 where higher scores indicate that the response better satisfies the unit test criteria.

The LMUnit model achieves leading averaged performance across preference, direct scoring, and fine-grained unit test evaluation tasks, as measured by FLASK and BiGGen Bench, and performs on par with frontier models for coarse evaluation of long-form responses (per LFQA). The model also demonstrates exceptional alignment with human preferences, ranking in the top 5 of the RewardBench benchmark with 93.5% accuracy and in top #2 of RewardBench2 with 82.1% accuracy.

For more details, please check out the [blogpost](https://contextual.ai/research/lmunit) or the [paper](https://arxiv.org/abs/2412.13091).

## Model Details

LMUnit is highly performant and versatile because of key methodologies in its training approach:

- **Multi-Objective Training:** The model simultaneously learns from multiple evaluation signals, including pairwise comparisons between responses, direct quality ratings, and specialized criteria-based judgments.
- **Synthetic Data Generation:** We developed a sophisticated pipeline to generate training data that captures nuanced, fine-grained evaluation criteria and subtle quality distinctions between responses across a wide range of use cases and scenarios.
- **Importance Weighting:** We demonstrate that adjusting unit test weights to reflect the relative importance of different criteria achieves results that better align with human preferences.

### Model Description

- **Developed by:** Contextual AI
- **Language(s) (NLP):** English
- **Finetuned from model:** Qwen2.5-72B

### Model Sources

- **Repository:** https://github.com/ContextualAI/LMUnit
- **Paper:** https://arxiv.org/abs/2412.13091

## ๐Ÿš€ Model Quick Start

### Installation
```bash
pip install lmunit
```

### Basic Usage
```python
from lmunit import LMUnit
from vllm import SamplingParams

# Initialize LMUnit
model = LMUnit(
    model_path="ContextualAI/LMUnit-qwen2.5-72b",
    tp_size=4
)

# Define evaluation
query = "What is the capital of France?"
response = "Paris"
unit_test = "Does the response correctly identify the capital city?"

# Generate score
sampling_params = SamplingParams(temperature=0.0, max_tokens=10, logprobs=20)
prompt = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}"
output = model.generate(prompt, sampling_params)
print(output)
```

### Alternative: Using Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model
tokenizer = AutoTokenizer.from_pretrained("ContextualAI/LMUnit-qwen2.5-72b")
model = AutoModelForCausalLM.from_pretrained("ContextualAI/LMUnit-qwen2.5-72b")

# Prepare prompt
query = "What is the capital of France?"
response = "Paris"
unit_test = "Does the response correctly identify the capital city?"
content = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}"

messages = [{"role": "user", "content": content}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

# Generate
outputs = model.generate(**inputs, max_new_tokens=40)
result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])
print(result)
```

For more examples, see our [GitHub repository](https://github.com/ContextualAI/LMUnit).

### Evaluation - Results

| Model | Flask | BiGGen-Bench | Human-Internal | InfoBench | RB | LFQA | RB2 |
|:------|------:|-------------:|---------------:|----------:|----:|------:|----:|
| **LMUnit-LLaMA-3.1-70B** | 72.03 | 67.69 | 93.63 | 89.00 | 91.56 | 76.15 | 80.5 |
| **LMUnit-Qwen2.5-72B** | 73.85 | 69.56 | 94.44 | 88.67 | 91.13 | 73.85 | 82.1 |

## Citation

If you find our work helpful, feel free to cite our paper:
```bibtex
@inproceedings{saadfalcon2025lmunit,
      title={{LMUnit}: Fine-grained Evaluation with Natural Language Unit Tests}, 
      author={Jon Saad-Falcon and Rajan Vivek and William Berrios and Nandita Shankar Naik and Matija Franklin and Bertie Vidgen and Amanpreet Singh and Douwe Kiela and Shikib Mehri},
      booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
      year={2025},
      url={https://arxiv.org/abs/2412.13091}
}
```