The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published 10 days ago • 6
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Paper • 2509.25531 • Published 25 days ago • 7
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published 16 days ago • 31
What Language Model to Train if You Have One Million GPU Hours? Paper • 2210.15424 • Published Oct 27, 2022 • 2
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper • 2211.05100 • Published Nov 9, 2022 • 34
Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies Paper • 2305.12586 • Published May 21, 2023
Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG Paper • 2507.20136 • Published Jul 27
Multilingual State Space Models for Structured Question Answering in Indic Languages Paper • 2502.01673 • Published Feb 1 • 2
A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings Paper • 2505.12116 • Published May 17
Deep learning for affective computing: text-based emotion recognition in decision support Paper • 1803.06397 • Published Mar 16, 2018
Deep contextualized word representations for detecting sarcasm and irony Paper • 1809.09795 • Published Sep 26, 2018
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published May 17 • 10
Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks Paper • 2505.11239 • Published May 16