are-preview (Agents Research Environments Preview)

evijit

posted an update 28 days ago

Post

2502

AI for Scientific Discovery Won't Work Without Fixing How We Collaborate.

My co-author @cgeorgiaw and I just published a paper challenging a core assumption: that the main barriers to AI in science are technical. They're not. They're social.

Key findings:

🚨 The "AI Scientist" myth delays progress: Waiting for AGI devalues human expertise and obscures science's real purpose: cultivating understanding, not just outputs.
📊 Wrong incentives: Datasets have 100x longer impact than models, yet data curation is undervalued.
⚠️ Broken collaboration: Domain scientists want understanding. ML researchers optimize performance. Without shared language, projects fail.
🔍 Fragmentation costs years: Harmonizing just 9 cancer files took 329 hours.

Why this matters: Upstream bottlenecks like efficient PDE solvers could accelerate discovery across multiple sciences. CASP mobilized a community around protein structure, enabling AlphaFold. We need this for dozens of challenges.

Thus, we're launching Hugging Science! A global community addressing these barriers through collaborative challenges, open toolkits, education, and community-owned infrastructure. Please find all the links below!

Paper: AI for Scientific Discovery is a Social Problem (2509.06580)
Join:

hugging-science
Discord: https://discord.com/invite/VYkdEVjJ5J

mlcu

authored a paper about 1 month ago

ARE: Scaling Up Agent Environments and Evaluations

Paper • 2509.17158 • Published Sep 21 • 34

gregmialz

authored a paper about 1 month ago

ARE: Scaling Up Agent Environments and Evaluations

Paper • 2509.17158 • Published Sep 21 • 34

mortimerp9

authored a paper about 1 month ago

ARE: Scaling Up Agent Environments and Evaluations

Paper • 2509.17158 • Published Sep 21 • 34

evijit

posted an update 4 months ago

Post

398

New blog post alert! "What is the Hugging Face Community Building?", with @yjernite and @irenesolaiman

What 1.8 Million Models Reveal About Open Source Innovation: Our latest deep dive into the Hugging Face Hub reveals patterns that challenge conventional AI narratives:

🔗 Models become platforms for innovation Qwen, Llama, and Gemma models have spawned entire ecosystems of specialized variants. Looking at derivative works shows community adoption better than any single metric.

📊 Datasets reveal the foundation layer → Most downloaded datasets are evaluation benchmarks (MMLU, Squad, GLUE) → Universities and research institutions dominate foundational data → Domain-specific datasets thrive across finance, healthcare, robotics, and science → Open actors provide the datasets that power most AI development

🏛️ Research institutions lead the charge: AI2 (Allen Institute) emerges as one of the most active contributors, alongside significant activity from IBM, NVIDIA, and international organizations. The open source ecosystem spans far beyond Big Tech.

🔍 Interactive exploration tools: We've built several tools to help you discover patterns!

ModelVerse Explorer - organizational contributions
DataVerse Explorer - dataset patterns
Organization HeatMap - activity over time
Base Model Explorer - model family trees
Semantic Search - find models by capability

📚 Academic research is thriving: Researchers are already producing valuable insights, including recent work at FAccT 2025: "The Brief and Wondrous Life of Open Models." We've also made hub datasets, weekly snapshots, and other data available for your own analysis.

The bottom line: AI development is far more distributed, diverse, and collaborative than popular narratives suggest. Real innovation happens through community collaboration across specialized domains.

Read: https://huggingface.co/blog/evijit/hf-hub-ecosystem-overview

evijit

posted an update 5 months ago

Post

1735

The HF Policy Team submitted our response to the 2025 National Artificial Intelligence (AI) Research and Development (R&D) Strategic Plan.

Blog (with link to full pdf response):

https://huggingface.co/blog/evijit/us-ai-research-strategy-rfi

clefourrier

posted an update 6 months ago

Post

1753

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

·

evijit

authored a paper 6 months ago

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Paper • 2503.16861 • Published Mar 21 • 1

evijit

authored 3 papers 7 months ago

Stop treating `AGI' as the north-star goal of AI research

Paper • 2502.03689 • Published Feb 6 • 5

"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services

Paper • 2504.09346 • Published Apr 12 • 3

Fully Autonomous AI Agents Should Not be Developed

Paper • 2502.02649 • Published Feb 4 • 35

clefourrier

posted an update 8 months ago

Post

2639

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

clefourrier

authored a paper 9 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 246

clefourrier

authored a paper 11 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 21

clefourrier

authored 2 papers over 1 year ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 237

clefourrier

posted an update over 1 year ago

Post

6164

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update over 1 year ago

Post

4795

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update over 1 year ago

Post

2275

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

evijit

authored a paper over 1 year ago

Can There be Art Without an Artist?

Paper • 2209.07667 • Published Sep 16, 2022 • 1

Agents Research Environments Preview

AI & ML interests

Recent Activity

ARE: Scaling Up Agent Environments and Evaluations

ARE: Scaling Up Agent Environments and Evaluations

ARE: Scaling Up Agent Environments and Evaluations

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Stop treating `AGI' as the north-star goal of AI research

"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services

Fully Autonomous AI Agents Should Not be Developed

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

GAIA: a benchmark for General AI Assistants

Can There be Art Without an Artist?

AI & ML interests

Recent Activity

Team members 5

are-preview's activity