Mike Ravkine's picture

Mike Ravkine PRO

mike-ravkine

·

the-crypt-keeper

AI & ML interests

LLM Research / Development / Evaluation

Recent Activity

posted an update about 16 hours ago

Spooky season is coming 👻 and there's nothing scarier then poor LLM evaluation results, right? The ReasonScape m12x dataset, explorer and leaderboard been updated with 12 additional models to bring up the total to *54* models and over *7.1B thinking tokens* and the groups filter has been split into two: family and size, which lets us take a peek at our first small-models-only reasoning result! The top performer, https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 has an enormous overthink problem so I would actually give this crown to https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 - simply ask it to think Step-by-Step to bring out it's latent hybrid-COT behavior and enjoy stellar performance at a fraction of the tokens! Right below the Qwen3-4B set and sitting just above https://huggingface.co/Qwen/Qwen3-1.7B is a model from a family that is somewhat underrated: https://huggingface.co/tencent/Hunyuan-4B-Instruct Note that vllm 0.11.0 has trouble with Hunyuan dense models, it seems to only support the MoE variants, I used 0.10.2 for my evaluations. In 7th place and worth a mention is https://huggingface.co/HuggingFaceTB/SmolLM3-3B which is a very efficient smaller thinker, but it's achilees heel was the strawberry test: it fails both letter counting and word sorting. The last model worth discussing in this context is https://huggingface.co/google/gemma-3-4b-it which is obviously not a reasoning model but when asked to think step-by-step it demonstrated tolerable performance across several tasks with incredibly low token utilization compared to most of these little guys. Would love to hear from the community! Do you use any of these models in your day-to-day? Did I miss any? Let me know! Full leaderboard @ https://reasonscape.com/m12x/leaderboard/

reacted to DmitryRyumin's post with 🔥 about 16 hours ago

🚀🤖🌟 New Research Alert - ICCV 2025 (Oral)! 🌟🤖🚀 📄 Title: Variance-based Pruning for Accelerating and Compressing Trained Networks 🔝 📝 Description: The one-shot pruning method efficiently compresses networks, reducing computation and memory usage while retaining almost full performance and requiring minimal fine-tuning. 👥 Authors: Uranik Berisha, Jens Mehnert, and Alexandru Paul Condurache 📅 Conference: ICCV, 19 – 23 Oct, 2025 | Honolulu, Hawai'i, USA 🇺🇸 📄 Paper: https://huggingface.co/papers/2507.12988 🚀 ICCV-2023-25-Papers: https://github.com/DmitryRyumin/ICCV-2023-25-Papers 🚀 Added to the Efficient Learning Section: https://github.com/DmitryRyumin/ICCV-2023-25-Papers/blob/main/sections/2025/main/efficient-learning.md 📚 More Papers: more cutting-edge research presented at other conferences in the https://huggingface.co/spaces/DmitryRyumin/NewEraAI-Papers curated by @DmitryRyumin 🔍 Keywords: #VarianceBasedPruning #NetworkCompression #ModelAcceleration #EfficientDeepLearning #VisionTransformers #AI #ICCV2025 #ResearchHighlight

posted an update 2 days ago

Spatial reasoning is a domain where LLMs struggle surprisingly hard. A new paper, "Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models" compares performance on a handful of spacial reasoning tasks and finds all SOTA LLMs breaking down and hallucinating their faces off when the grids get large. https://arxiv.org/html/2510.20198v1 The word search task is especially revealing: notice the bias towards detecting "horizontal" while struggling with "vertical" - LLMs only understand simple, linear relationships.. add a stride for 2D and it's basically over.

View all activity

Organizations

None yet

mike-ravkine 's datasets 3

mike-ravkine/AlteredWorlds

Viewer • Updated Aug 31, 2024 • 447 • 21 • 3

mike-ravkine/rosettacode-parsed

Viewer • Updated Jun 20, 2023 • 4.26k • 56 • 10

mike-ravkine/can-ai-code_junior-dev_v1

Viewer • Updated May 30, 2023 • 24 • 6