LMMs-Lab

community

https://www.lmms-lab.com/

lmmslab

EvolvingLMMs-Lab

Activity Feed

AI & ML interests

Feeling and building the multimodal intelligence.

Recent Activity

Paranioar authored a paper about 5 hours ago

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

kcz358 updated a collection 15 days ago

LongVT

kcz358 updated a model 18 days ago

lmms-lab/BAGEL-7B-MoT-ver.LE

View all activity

Papers

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

View all Papers

Organization Card

Community About org cards

We are looking for collaboration and compute donation! Contact [email protected] if you want to vibe researching with us.

[2025-11] 🔭🔭 Introducing LongVT: an end-to-end agentic framework for "Thinking with Long Videos" via native tool calling

💻 GitHub | 🤗 Model and Dataset | 📖 Paper | 📚 Blog | 💻 Demo
[2025-11] 🔥🔥 Introducing OpenMMReasoner: a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL).

💻 GitHub | 🤗 Model and Dataset | 📖 Paper | 📚 Blog
[2025-9] 🔥🔥 Introducing LLaVA-OneVision-1.5: a novel family of fully open-source Large Multimodal Models (LMMs) that achieves state-of-the-art performance with substantially lower cost through training on native resolution images.

💻 GitHub | 🤗 Model and Dataset | 📖 Paper
[2025-9] 🔥🔥 Introducing LLaVA-Critic-R1: A family of generative critic VLM trained through GRPO using pairwise critic data. LLaVA-Critic-R1 not only demonstrates strong critic capability, but also achieves SoTA policy performance at the 7B scale.

💻 GitHub | 🤗 Model and Dataset | 📖 Paper
[2025-4] 🔈🔈 Introducing Aero-1-Audio: It is a compact audio model adept at various audio tasks, including speech recognition, audio understanding, and following audio instructions.

📚 Blog | 🤗 Model Checkpoints | 📖 Evaluation Results | 📚 Cookbook
[2025-3] 👓👓 Introducing EgoLife: Towards Egocentric Life Assistant. For one week, six individuals lived together, capturing every moment through AI glasses, and creating the EgoLife dataset. Based on this we build models and benchmarks to drive the future of AI life assistants that capable of recalling past events, tracking habits, and providing personalized, long-context assistance to enhance daily life.

Homepage | Github | Blog | Paper | Demo
[2025-1] 🎬🎬 Introducing VideoMMMU: Evaluating Knowledge Acquisition from Professional Videos. Spanning 6 professional disciplines (Art, Business, Science, Medicine, Humanities, Engineering) and 30 diverse subjects, Video-MMMU challenges models to learn and apply college-level knowledge from videos.

Homepage | Github | Paper
[2024-11] 🔔🔔 We are excited to introduce LMMs-Eval/v0.3.0, focusing on audio understanding. Building upon LMMs-Eval/v0.2.0, we have added audio models and tasks. Now, LMMs-Eval provides a consistent evaluation toolkit across image, video, and audio modalities.

GitHub | Documentation
[2024-11] 🤯🤯 We introduce Multimodal SAE, the first framework designed to interpret learned features in large-scale multimodal models using Sparse Autoencoders. Through our approach, we leverage LLaVA-OneVision-72B to analyze and explain the SAE-derived features of LLaVA-NeXT-LLaMA3-8B. Furthermore, we demonstrate the ability to steer model behavior by clamping specific features to alleviate hallucinations and avoid safety-related issues.

GitHub | Paper
[2024-10] 🔥🔥 We present LLaVA-Critic, the first open-source large multimodal model as a generalist evaluator for assessing LMM-generated responses across diverse multimodal tasks and scenarios.

GitHub | Blog
[2024-10] 🎬🎬 Introducing LLaVA-Video, a family of open large multimodal models designed specifically for advanced video understanding. We're open-sourcing LLaVA-Video-178K, a high-quality, synthetic dataset for video instruction tuning.

GitHub | Blog
[2024-08] 🤞🤞 We present LLaVA-OneVision, a family of LMMs developed by consolidating insights into data, models, and visual representations.

GitHub | Blog
[2024-06] 🧑‍🎨🧑‍🎨 We release LLaVA-NeXT-Interleave, an LMM extending capabilities to real-world settings: Multi-image, Multi-frame (videos), Multi-view (3D), and Multi-patch (single-image).

GitHub | Blog
[2024-06] 🚀🚀 We release LongVA, a long language model with state-of-the-art video understanding performance.

GitHub | Blog

Older Updates (2024-06 and earlier)

[2024-06] 🎬🎬 The lmms-eval/v0.2 toolkit now supports video evaluations for models like LLaVA-NeXT Video and Gemini 1.5 Pro.

GitHub | Blog
[2024-05] 🚀🚀 We release LLaVA-NeXT Video, a model performing at Google's Gemini level on video understanding tasks.

GitHub | Blog
[2024-05] 🚀🚀 The LLaVA-NeXT model family reaches near GPT-4V performance on multimodal benchmarks, with models up to 110B parameters.

GitHub | Blog
[2024-03] We release lmms-eval, a toolkit for holistic evaluations with 50+ multimodal datasets and 10+ models.

GitHub | Blog