HuggingFaceM4 (HuggingFaceM4)

andito

authored a paper 3 days ago

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published 6 days ago • 52

andito

posted an update 4 days ago

Post

1501

Finally, our new paper is out! "𝗙𝗶𝗻𝗲𝗩𝗶𝘀𝗶𝗼𝗻: 𝗢𝗽𝗲𝗻 𝗗𝗮𝘁𝗮 𝗜𝘀 𝗔𝗹𝗹 𝗬𝗼𝘂 𝗡𝗲𝗲𝗱"! 🥳
FineVision: Open Data Is All You Need (2510.17269)

If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.

FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.

In the paper, we share how we built it:
🔍 finding and cleaning data at scale
🧹 removing excessive duplicates across sources
🤗 decontaminating against 66 public benchmarks

My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!

🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled

It’s ready to stream, so you can start training your own models right away:

from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))

A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!

ariG23498

authored a paper 4 days ago

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published 6 days ago • 52

andito

updated a dataset 5 days ago

HuggingFaceM4/FineVisionMax

Viewer • Updated 5 days ago • 24.2M • 9.7k • 9

andito

published a dataset 5 days ago

HuggingFaceM4/FineVisionMax

Viewer • Updated 5 days ago • 24.2M • 9.7k • 9

andito

in HuggingFaceM4/FineVisionMax 5 days ago

Feat: Move 9999 root .parquet files to 'full/' directory

#2 opened 5 days ago by

andito

tfrere

updated a Space 5 days ago

183

FineVision: Open Data is All You Need

📝

A new open-source dataset for training VLMs

andito

in HuggingFaceM4/FineVision 5 days ago

Added arxiv

#29 opened 5 days ago by

lusxvr

merve

posted an update 6 days ago

Post

4335

deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages

2 replies

·

multimodalart

posted an update 10 days ago

Post

1295

Want to iterate on a Hugging Face Space with an LLM?

Now you can easily convert any HF entire repo (Model, Dataset or Space) to a text file and feed it to a language model!

multimodalart/repo2txt

thomwolf

authored a paper 10 days ago

Robot Learning: A Tutorial

Paper • 2510.12403 • Published 12 days ago • 86

lvwerra

authored a paper 12 days ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published 16 days ago • 32

andito

in HuggingFaceM4/FineVision 13 days ago

Upload video_20251001_011427.mp4

#26 opened 25 days ago by

Sachiuii

tfrere

in HuggingFaceM4/FineVision 15 days ago

Update README.md

#27 opened 18 days ago by

lusxvr

giadap

posted an update 16 days ago

Post

4351

🌎 AI ethics and sustainability are two sides of the same coin.

In our new blog post with Dr. Sasha Luccioni, we argue that separating them (as is too often the case) means missing the bigger picture of how AI systems impact both people and the planet.

Ethical and sustainable AI development can’t be pursued in isolation. The same choices that affect who benefits or is harmed by AI systems also determine how much energy and resources they consume.

We explore how two key concepts, evaluation and transparency, can serve as bridges between these domains:

📊 Evaluation, by moving beyond accuracy or performance metrics to include environmental and social costs, as we’ve done with tools like the AI Energy Score.

🔍 Transparency, by enabling reproducibility, accountability, and environmental reporting through open tools like the Environmental Transparency Space.

AI systems mirror our priorities. If we separate ethics from sustainability, we risk building technologies that are efficient but unjust, or fair but unsustainable.

Read our blog post here: https://huggingface.co/blog/sasha/ethics-sustainability

AIEnergyScore/Leaderboard
sasha/environmental-transparency

1 reply

·

sasha

authored 3 papers 18 days ago

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Paper • 2211.02001 • Published Nov 3, 2022

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

Paper • 2409.14160 • Published Sep 21, 2024 • 3

From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate

Paper • 2501.16548 • Published Jan 27

giadap

posted an update 27 days ago

Post

10819

One of the hardest challenges in AI safety is finding the right balance: how do we protect people from harm without undermining their agency? This tension is especially visible in conversational systems, where safeguards can sometimes feel more paternalistic than supportive.

In my latest piece for Hugging Face, I argue that open source and community-driven approaches offer a promising (though not exclusive) way forward.

✨ Transparency can make safety mechanisms into learning opportunities.
✨ Collaboration with diverse communities makes safeguards more relevant across contexts.
✨ Iteration in the open lets protections evolve rather than freeze into rigid, one-size-fits-all rules.

Of course, this isn’t a silver bullet. Top-down safety measures will still be necessary in some cases. But if we only rely on corporate control, we risk building systems that are safe at the expense of trust and autonomy.

Read the blog post here: https://huggingface.co/blog/giadap/preserving-agency

7 replies

·

abidlabs

posted an update about 1 month ago

Post

1239

What other features would you like to see on the Trackio Dashboard? ( gradio-templates/trackio-dashboard)

HuggingFaceM4

AI & ML interests

Recent Activity

FineVision: Open Data Is All You Need

FineVision: Open Data Is All You Need

HuggingFaceM4/FineVisionMax

HuggingFaceM4/FineVisionMax

Feat: Move 9999 root .parquet files to 'full/' directory

FineVision: Open Data is All You Need

Added arxiv

Robot Learning: A Tutorial

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Upload video_20251001_011427.mp4

Update README.md

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate

AI & ML interests

Recent Activity

Team members 43

HuggingFaceM4's activity

Feat: Move 9999 root .parquet files to 'full/' directory

FineVision: Open Data is All You Need

Added arxiv

Upload video_20251001_011427.mp4

Update README.md