If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible. We wanted to change that.
FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.
In the paper, we share how we built it: 🔍 finding and cleaning data at scale 🧹 removing excessive duplicates across sources 🤗 decontaminating against 66 public benchmarks
My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets. NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!
🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled
It’s ready to stream, so you can start training your own models right away:
from datasets import load_dataset d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True) print(next(iter(d)))
A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
🌎 AI ethics and sustainability are two sides of the same coin.
In our new blog post with Dr. Sasha Luccioni, we argue that separating them (as is too often the case) means missing the bigger picture of how AI systems impact both people and the planet.
Ethical and sustainable AI development can’t be pursued in isolation. The same choices that affect who benefits or is harmed by AI systems also determine how much energy and resources they consume.
We explore how two key concepts, evaluation and transparency, can serve as bridges between these domains:
📊 Evaluation, by moving beyond accuracy or performance metrics to include environmental and social costs, as we’ve done with tools like the AI Energy Score.
🔍 Transparency, by enabling reproducibility, accountability, and environmental reporting through open tools like the Environmental Transparency Space.
AI systems mirror our priorities. If we separate ethics from sustainability, we risk building technologies that are efficient but unjust, or fair but unsustainable.
One of the hardest challenges in AI safety is finding the right balance: how do we protect people from harm without undermining their agency? This tension is especially visible in conversational systems, where safeguards can sometimes feel more paternalistic than supportive.
In my latest piece for Hugging Face, I argue that open source and community-driven approaches offer a promising (though not exclusive) way forward.
✨ Transparency can make safety mechanisms into learning opportunities. ✨ Collaboration with diverse communities makes safeguards more relevant across contexts. ✨ Iteration in the open lets protections evolve rather than freeze into rigid, one-size-fits-all rules.
Of course, this isn’t a silver bullet. Top-down safety measures will still be necessary in some cases. But if we only rely on corporate control, we risk building systems that are safe at the expense of trust and autonomy.