@codelion on Hugging Face: "On this day in 2019, OpenAI released the final GPT-2 model as part of their…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

codelion

posted an update 18 days ago

Post

3591

On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.

To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.

The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

If you're interested in the full story of how we discovered this optimal mixture and why curriculum learning catastrophically failed, check out the complete article: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Sometimes less really is more - when you mix it right.

unmodeled-tyler

18 days ago

This is so cool! What a fun project. Thanks for sharing!

In this post