RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

This model is a PEFT (LoRA) adapter for the meta-llama/Llama-2-7b-hf base model, specifically trained for Confidence Alignment. It was fine-tuned using the CONQORD dataset.

The model was presented in the paper Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

Abstract

Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models.

GitHub Repository

For more details on the RefAlign framework, training scripts, evaluation, and other models, please refer to the official GitHub repository.

Usage

This repository contains a PEFT (LoRA) adapter. To obtain the full model (base model + adapter) for inference, please use the provided merge_model.py script.

Framework versions

PEFT 0.11.1

Downloads last month: 69

Model tree for mzhaoshuai/Llama-2-7b-hf-conf-sft

Base model

meta-llama/Llama-2-7b-hf

Adapter

(2357)

this model

Finetunes

1 model

Dataset used to train mzhaoshuai/Llama-2-7b-hf-conf-sft

Collection including mzhaoshuai/Llama-2-7b-hf-conf-sft

RefAlign: RL with Similarity-based Rewards

Collection

Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data. • 19 items • Updated 5 days ago • 1