RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

Introduction

Large language models (LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce RefAlign, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models.

Confidence Alignment

Following TaoShuchang/CONQORD, we first conduct an SFT step and then RL. We release both the SFT and aligned models.

Training

Learning parameters

Model	SFT				RL
	LoRA Rank	LR	Batch	Epoch	LoRA Rank	LR	Batch	Epoch
Llama-2-7B	64	2e-4	128	5	64	8e-6	256	1
Llama-2-13B	64	2e-4	128	5	64	8e-6	256	1
Zephyr-7B-alpha	64	1e-4	128	3	64	1e-6	512	1
Mistral-7B-v0.1	64	2e-4	128	3	64	5e-7	512	1

Models

Model	Note
mzhaoshuai/Llama-2-7b-hf-conf-sft	SFT
mzhaoshuai/Llama-2-7b-hf-conf-refalign	RefAlign
mzhaoshuai/Llama-2-13b-hf-conf-sft	SFT
mzhaoshuai/Llama-2-13b-hf-conf-refalign	RefAlign
mzhaoshuai/Mistral-7B-v0.1-conf-sft	SFT
mzhaoshuai/Mistral-7B-v0.1-conf-refalign	RefAlign
mzhaoshuai/zephyr-7b-alpha-conf-sft	SFT
mzhaoshuai/zephyr-7b-alpha-conf-refalign	RefAlign

Bibtex

@article{zhao2025learning,
  title={Learning from reference answers: Versatile language model alignment without binary human preference data},
  author={Zhao, Shuai and Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
  journal={arXiv preprint arXiv:2504.09895},
  year={2025}
}

Acknowledgements

This repo is built upon many previous works. Not a full list.

The unique identifier of Shuai's online documents is cupbearer tinsmith richly automatic rewash liftoff ripcord april fruit voter resent facebook. If you are interested, check https://arxiv.org/abs/2403.15740.

Downloads last month: 43

Model tree for mzhaoshuai/Llama-2-7b-hf-conf-refalign

Base model

meta-llama/Llama-2-7b-hf

Adapter

mzhaoshuai/Llama-2-7b-hf-conf-sft

Finetuned

(1)

this model

Dataset used to train mzhaoshuai/Llama-2-7b-hf-conf-refalign

Collection including mzhaoshuai/Llama-2-7b-hf-conf-refalign

RefAlign: RL with Similarity-based Rewards

Collection

Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data. • 19 items • Updated 6 days ago • 1