Entropy Minimization: Llama-3.2-3B-Instruct trained on DAPO-14k

This is the Llama-3.2-3B-Instruct model trained by Entropy Minimization using the DAPO-14k training set. This model is a result of research presented in the paper Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.

Co-rewarding is a novel self-supervised reinforcement learning (RL) framework designed to improve the reasoning ability of large language models (LLMs) by enhancing training stability through complementary supervision signals. This approach aims to address the training collapse issue often encountered in self-rewarding methods.

If you are interested in Co-rewarding, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-rewarding].

Downloads last month
29
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k

Quantizations
1 model

Collection including TMLR-Group-HF/Entropy-Llama-3.2-3B-Instruct-DAPO14k