RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

This is the model aligned with SimPO described in the paper Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce RefAlign, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models.

The code for the RefAlign framework, including training and evaluation scripts, can be found on the official GitHub repository: https://github.com/mzhaoshuai/RefAlign.

The training data is mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3.

When conducting alignment with SimPO, the reward function is BERTScore. The alignment is conducted in an online manner.

Hyper-Parameters Value
LR 3e-7
Batch Size 120
Epoch 1
Prompt Length 512
Generation Length 1024
Sampled Generations (K) 3
BertScore Model bart-large-mnli
SimPO beta 2.5
SimPO beta/gamma 0.3
Downloads last month
35
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mzhaoshuai/Mistral-7B-Instruct-v0.2-ref-simpo

Finetuned
(1026)
this model

Dataset used to train mzhaoshuai/Mistral-7B-Instruct-v0.2-ref-simpo

Collection including mzhaoshuai/Mistral-7B-Instruct-v0.2-ref-simpo