Entropy Minimization: Llama-3.2-3B-Instruct trained on DAPO-14k

This is the Llama-3.2-3B-Instruct model trained by Entropy Minimization using the DAPO-14k training set. This model is a result of research presented in the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.

Co-rewarding is a novel self-supervised reinforcement learning (RL) framework designed to improve the reasoning ability of large language models (LLMs) by enhancing training stability through complementary supervision signals. This approach aims to address the training collapse issue often encountered in self-rewarding methods.

If you are interested in Co-rewarding, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-rewarding].

Downloads last month: 29

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k

Quantizations

1 model

Collection including TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k

Co-rewarding

Collection

Co-rewarding is a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. • 69 items • Updated 28 days ago • 1