- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2504.09641 
						
					
				- 
	
	
	
				Zhang199/TinyLLaVA-Video-R1
Video-Text-to-Text • 4B • Updated • 43 • 4 - 
	
	
	
				Zhang199/TinyLLaVA-Video-Coldstart_NextQA_16
Video-Text-to-Text • 4B • Updated • 662 • 1 - 
	
	
	
Zhang199/TinyLLaVA-Video-R1-training-data
Updated • 110 • 1 - 
	
	
	
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper • 2504.09641 • Published • 16 
- 
	
	
	
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 - 
	
	
	
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 - 
	
	
	
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 - 
	
	
	
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4 
- 
	
	
	
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 - 
	
	
	
Token-Budget-Aware LLM Reasoning
Paper • 2412.18547 • Published • 46 - 
	
	
	
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 37 - 
	
	
	
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Paper • 2412.17256 • Published • 47 
- 
	
	
	
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 - 
	
	
	
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 - 
	
	
	
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 - 
	
	
	
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 49 
- 
	
	
	
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 - 
	
	
	
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 - 
	
	
	
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 - 
	
	
	
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 
- 
	
	
	
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Paper • 2503.10615 • Published • 17 - 
	
	
	
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Paper • 2503.10630 • Published • 6 - 
	
	
	
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 36 - 
	
	
	
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Paper • 2503.07536 • Published • 88 
- 
	
	
	
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 - 
	
	
	
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 
- 
	
	
	
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 147 - 
	
	
	
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Paper • 2501.01245 • Published • 5 - 
	
	
	
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 47 - 
	
	
	
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Paper • 2501.08326 • Published • 33 
- 
	
	
	
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 - 
	
	
	
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 - 
	
	
	
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 - 
	
	
	
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 - 
	
	
	
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 - 
	
	
	
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 - 
	
	
	
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 
- 
	
	
	
				Zhang199/TinyLLaVA-Video-R1
Video-Text-to-Text • 4B • Updated • 43 • 4 - 
	
	
	
				Zhang199/TinyLLaVA-Video-Coldstart_NextQA_16
Video-Text-to-Text • 4B • Updated • 662 • 1 - 
	
	
	
Zhang199/TinyLLaVA-Video-R1-training-data
Updated • 110 • 1 - 
	
	
	
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper • 2504.09641 • Published • 16 
- 
	
	
	
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Paper • 2503.10615 • Published • 17 - 
	
	
	
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Paper • 2503.10630 • Published • 6 - 
	
	
	
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 36 - 
	
	
	
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Paper • 2503.07536 • Published • 88 
- 
	
	
	
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 - 
	
	
	
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 - 
	
	
	
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 - 
	
	
	
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4 
- 
	
	
	
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 - 
	
	
	
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 
- 
	
	
	
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 - 
	
	
	
Token-Budget-Aware LLM Reasoning
Paper • 2412.18547 • Published • 46 - 
	
	
	
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 37 - 
	
	
	
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Paper • 2412.17256 • Published • 47 
- 
	
	
	
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 147 - 
	
	
	
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Paper • 2501.01245 • Published • 5 - 
	
	
	
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 47 - 
	
	
	
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Paper • 2501.08326 • Published • 33 
- 
	
	
	
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 - 
	
	
	
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 - 
	
	
	
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 - 
	
	
	
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 49 
- 
	
	
	
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 - 
	
	
	
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 - 
	
	
	
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 - 
	
	
	
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30