mzhaoshuai 's Collections

RefAlign: RL with Similarity-based Rewards

Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.