- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2506.16141 
						
					
				- 
	
	
	
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 - 
	
	
	
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 - 
	
	
	
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 - 
	
	
	
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12 
- 
	
	
	
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 - 
	
	
	
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 - 
	
	
	
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 - 
	
	
	
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4 
- 
	
	
	
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 - 
	
	
	
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 - 
	
	
	
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 - 
	
	
	
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 
- 
	
	
	
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 74 - 
	
	
	
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 - 
	
	
	
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 80 - 
	
	
	
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Paper • 2506.03147 • Published • 58 
- 
	
	
	
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Paper • 2504.08672 • Published • 55 - 
	
	
	
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Paper • 2504.12322 • Published • 28 - 
	
	
	
Learning to Reason under Off-Policy Guidance
Paper • 2504.14945 • Published • 88 - 
	
	
	
TTRL: Test-Time Reinforcement Learning
Paper • 2504.16084 • Published • 120 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 - 
	
	
	
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 - 
	
	
	
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 - 
	
	
	
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 
- 
	
	
	
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 74 - 
	
	
	
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 - 
	
	
	
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 80 - 
	
	
	
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Paper • 2506.03147 • Published • 58 
- 
	
	
	
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 - 
	
	
	
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 12 - 
	
	
	
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 - 
	
	
	
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12 
- 
	
	
	
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Paper • 2504.08672 • Published • 55 - 
	
	
	
A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Paper • 2504.12322 • Published • 28 - 
	
	
	
Learning to Reason under Off-Policy Guidance
Paper • 2504.14945 • Published • 88 - 
	
	
	
TTRL: Test-Time Reinforcement Learning
Paper • 2504.16084 • Published • 120 
- 
	
	
	
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 - 
	
	
	
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 - 
	
	
	
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 - 
	
	
	
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4