49 17 603

Mike Ravkine PRO

mike-ravkine

the-crypt-keeper

AI & ML interests

LLM Research / Development / Evaluation

Recent Activity

posted an update 3 days ago

"Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking" https://arxiv.org/abs/2510.07880 > Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management. Really cool paper! Unfortunately no code, may need to attempt to recreate my own TRACE 🤔

liked a model 5 days ago

noctrex/cogito-v2-preview-llama-109B-MoE-MXFP4_MOE-GGUF

liked a model 5 days ago

deepcogito/cogito-v2-preview-llama-70B

View all activity

Organizations

None yet

posted an update 3 days ago

Post

246

"Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking"

https://arxiv.org/abs/2510.07880

> Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.

Really cool paper! Unfortunately no code, may need to attempt to recreate my own TRACE 🤔

reacted to adlumal's post with 🔥 7 days ago

Post

2434

MLEB is the largest, most diverse, and most comprehensive benchmark for legal text embedding models. https://huggingface.co/blog/isaacus/introducing-mleb

reacted to abdurrahmanbutler's post with 👀 7 days ago

Post

2515

🎉 I am excited to share news of a project my brother, Umar Butler, and I have been working on for what feels like an eternity now.

𝐈𝐧𝐭𝐫𝐨𝐝𝐮𝐜𝐢𝐧𝐠 𝐌𝐋𝐄𝐁 — 𝐭𝐡𝐞 𝐌𝐚𝐬𝐬𝐢𝐯𝐞 𝐋𝐞𝐠𝐚𝐥 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤.

A suite of 10 high-quality English legal IR datasets, designed by legal experts to set a new standard for comparing embedding models.

Whether you’re exploring legal RAG on your home computer, or running enterprise-scale retrieval, apples-to-apples evaluation is crucial. That’s why we’ve open-sourced everything - including our 7 brand-new, hand-crafted retrieval datasets. All of these datasets are now live on Hugging Face.

Any guesses which embedding model leads on legal retrieval?

𝐇𝐢𝐧𝐭: it’s not OpenAI or Google - they place 7th and 9th on our leaderboard.

To do well on MLEB, embedding models must demonstrate both extensive legal domain knowledge and strong legal reasoning skills.

https://huggingface.co/blog/isaacus/introducing-mleb

1 reply

reacted to sergiopaniego's post with 🔥 7 days ago

Post

1818

New drop! 💥 The VLM Object Understanding Comparison Space now runs with Qwen3-VL-4B and moondream3.

You can compare how models reason about images 🧠

Bonus: thanks to @ariG23498 , you now get auto-suggested prompts to explore faster.

Let’s gooo

sergiopaniego/vlm_object_understanding

reacted to csabakecskemeti's post with 😎 7 days ago

Post

2470

Christmas came early this year

3 replies

posted an update 8 days ago

Post

1756

There are two very interesting reasoning models from

ServiceNow-AI that I think are flying under everyone's radar - lets take a closer look at ServiceNow-AI/Apriel-1.5-15b-Thinker (#10 on the ReasonScape rankings) and ServiceNow-AI/Apriel-Nemotron-15b-Thinker (landing just below its brother at #12).

A rather interesting attribute of these models is I have absolutely no idea what they are fine-tuned from, other then some kind of pre-small Mistrals! The non-nemo 15b looks like Mistral Pixtral 12B, but with 8 more layers while the nemo 15b analogously looks like Mistral NeMo 12B but with 10 more layers and a smaller max context length.

The performance trade-offs between these two models are quite clear: the Nemotron provides ~30% shorter answers but at the expense of totally collapsing under difficulty on 4 of the 12 tasks ... which all just happen to have "Math" in common, so it's pretty easy to point the finger at exactly what the price for the lower reasoning token usage is here.

In principle ServiceNow-AI/Apriel-1.5-15b-Thinker is multimodal and should be able to reason about image queries but this is not something I have tried as ReasonScape is not currently able to evaluate VLMs - perhaps a future improvement.

reacted to wenhuach's post with 🚀 8 days ago

Post

1668

AutoRound keeps evolving its LLM quantization algorithm! 🚀
After enhancing W2A16 quantization, we now offer a fast algorithm to generate mixed bits/data-type schemes (~2mins for 8B models), great for MXFP4 and W2A16.
Learn more: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme

reacted to nick007x's post with 👀 8 days ago

Post

1560

👋 Hey i have Just uploaded 2 new datasets for code and scientific reasoning models:

1. ArXiv Papers (4.6TB) A massive scientific corpus with papers and metadata across all domains.Perfect for training models on academic reasoning, literature review, and scientific knowledge mining. 🔗Link: nick007x/arxiv-papers

2. GitHub Code 2025 (1 TB)a comprehensive code dataset for code generation and analysis tasks. mostly contains GitHub's high quality top 1 million repos above 2 stars 🔗Link: nick007x/github-code-2025

posted an update 10 days ago

Post

3738

Let's talk about one of the hidden gems in the ReasonScape evaluation results, lucky #13: aquif-ai/aquif-3.5-8B-Think

Built on top of the solid Qwen3-8B foundation, aquif-3.5-8B-Think successfully preserves the high performance of the original model while consuming 30-50% less reasoning tokens.

The most notable regression vs the base model here is in arithmetic - if your workload is math heavy this model demonstrates an unfortunate collapse with performance under growing complexity.

The interesting combination of awesome overall performance on SVG simple shapes identification coupled with a total inability to recognize more complex shapes like 'House' or 'Arrow' is a behavior directly inherited from the base model (but with a ~20% improvement in token utilization).

If you like your reasoning models token-efficient, Aquif-3.5-8B-Think is well worth a spin.

Higher resolution, more detailed, interactive plots are available at the m12X explorer: https://reasonscape.com/m12x/explorer/

1 reply

reacted to jlopez-dl's post with 👍 10 days ago

Post

1751

Just posted https://huggingface.co/blog/jlopez-dl/hybrid-attention-game-changer, for those interested in Hybrid Attention

posted an update 11 days ago

Post

150

I am please to announce that ReasonScape M12X results and dataset are live!

Blog: https://huggingface.co/blog/mike-ravkine/building-reasonscape

Leaderboard: https://reasonscape.com/m12x/leaderboard/

Explorer: https://reasonscape.com/m12x/explorer/

Code: https://github.com/the-crypt-keeper?tab=repositories

Data (5.5B tokens, >2M tests): https://reasonscape.com/data/m12x/

Mike Ravkine PRO

AI & ML interests

Recent Activity

Organizations

mike-ravkine's activity