> Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.
Really cool paper! Unfortunately no code, may need to attempt to recreate my own TRACE ๐ค
A suite of 10 high-quality English legal IR datasets, designed by legal experts to set a new standard for comparing embedding models.
Whether youโre exploring legal RAG on your home computer, or running enterprise-scale retrieval, apples-to-apples evaluation is crucial. Thatโs why weโve open-sourced everything - including our 7 brand-new, hand-crafted retrieval datasets. All of these datasets are now live on Hugging Face.
Any guesses which embedding model leads on legal retrieval?
๐๐ข๐ง๐ญ: itโs not OpenAI or Google - they place 7th and 9th on our leaderboard.
To do well on MLEB, embedding models must demonstrate both extensive legal domain knowledge and strong legal reasoning skills.
A rather interesting attribute of these models is I have absolutely no idea what they are fine-tuned from, other then some kind of pre-small Mistrals! The non-nemo 15b looks like Mistral Pixtral 12B, but with 8 more layers while the nemo 15b analogously looks like Mistral NeMo 12B but with 10 more layers and a smaller max context length.
The performance trade-offs between these two models are quite clear: the Nemotron provides ~30% shorter answers but at the expense of totally collapsing under difficulty on 4 of the 12 tasks ... which all just happen to have "Math" in common, so it's pretty easy to point the finger at exactly what the price for the lower reasoning token usage is here.
In principle ServiceNow-AI/Apriel-1.5-15b-Thinker is multimodal and should be able to reason about image queries but this is not something I have tried as ReasonScape is not currently able to evaluate VLMs - perhaps a future improvement.
AutoRound keeps evolving its LLM quantization algorithm! ๐ After enhancing W2A16 quantization, we now offer a fast algorithm to generate mixed bits/data-type schemes (~2mins for 8B models), great for MXFP4 and W2A16. Learn more: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme
๐ Hey i have Just uploaded 2 new datasets for code and scientific reasoning models:
1. ArXiv Papers (4.6TB) A massive scientific corpus with papers and metadata across all domains.Perfect for training models on academic reasoning, literature review, and scientific knowledge mining. ๐Link: nick007x/arxiv-papers
2. GitHub Code 2025 (1 TB)a comprehensive code dataset for code generation and analysis tasks. mostly contains GitHub's high quality top 1 million repos above 2 stars ๐Link: nick007x/github-code-2025
Let's talk about one of the hidden gems in the ReasonScape evaluation results, lucky #13: aquif-ai/aquif-3.5-8B-Think
Built on top of the solid Qwen3-8B foundation, aquif-3.5-8B-Think successfully preserves the high performance of the original model while consuming 30-50% less reasoning tokens.
The most notable regression vs the base model here is in arithmetic - if your workload is math heavy this model demonstrates an unfortunate collapse with performance under growing complexity.
The interesting combination of awesome overall performance on SVG simple shapes identification coupled with a total inability to recognize more complex shapes like 'House' or 'Arrow' is a behavior directly inherited from the base model (but with a ~20% improvement in token utilization).
If you like your reasoning models token-efficient, Aquif-3.5-8B-Think is well worth a spin.