deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face 🔥
> not only a document converter but also can do document question answering, understand multiple languages 🤯 > best part: released with Apache 2.0 license 👏 use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! 🤗 > built on SigLIP2 & granite-165M
Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥 Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision