Entropy Minimization: Llama-3.2-3B-Instruct trained on DAPO-14k
	
This is the Llama-3.2-3B-Instruct model trained by Entropy Minimization using the DAPO-14k training set. This model is a result of research presented in the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.
Co-rewarding is a novel self-supervised reinforcement learning (RL) framework designed to improve the reasoning ability of large language models (LLMs) by enhancing training stability through complementary supervision signals. This approach aims to address the training collapse issue often encountered in self-rewarding methods.
If you are interested in Co-rewarding, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-rewarding].