Mike Ravkine's picture

Mike Ravkine PRO

mike-ravkine

AI & ML interests

LLM Research / Development / Evaluation

Recent Activity

posted an update about 16 hours ago
Spooky season is coming πŸ‘» and there's nothing scarier then poor LLM evaluation results, right? The ReasonScape m12x dataset, explorer and leaderboard been updated with 12 additional models to bring up the total to *54* models and over *7.1B thinking tokens* and the groups filter has been split into two: family and size, which lets us take a peek at our first small-models-only reasoning result! The top performer, https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 has an enormous overthink problem so I would actually give this crown to https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 - simply ask it to think Step-by-Step to bring out it's latent hybrid-COT behavior and enjoy stellar performance at a fraction of the tokens! Right below the Qwen3-4B set and sitting just above https://huggingface.co/Qwen/Qwen3-1.7B is a model from a family that is somewhat underrated: https://huggingface.co/tencent/Hunyuan-4B-Instruct Note that vllm 0.11.0 has trouble with Hunyuan dense models, it seems to only support the MoE variants, I used 0.10.2 for my evaluations. In 7th place and worth a mention is https://huggingface.co/HuggingFaceTB/SmolLM3-3B which is a very efficient smaller thinker, but it's achilees heel was the strawberry test: it fails both letter counting and word sorting. The last model worth discussing in this context is https://huggingface.co/google/gemma-3-4b-it which is obviously not a reasoning model but when asked to think step-by-step it demonstrated tolerable performance across several tasks with incredibly low token utilization compared to most of these little guys. Would love to hear from the community! Do you use any of these models in your day-to-day? Did I miss any? Let me know! Full leaderboard @ https://reasonscape.com/m12x/leaderboard/
View all activity

Organizations

None yet