- 
	
	
	EVA-CLIP-18B: Scaling CLIP to 18 Billion ParametersPaper • 2402.04252 • Published • 28
- 
	
	
	Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation ModelsPaper • 2402.03749 • Published • 14
- 
	
	
	ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingPaper • 2402.04615 • Published • 44
- 
	
	
	EfficientViT-SAM: Accelerated Segment Anything Model Without Performance LossPaper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2506.02387 
						
					
				- 
	
	
	MLLM-as-a-Judge for Image Safety without Human LabelingPaper • 2501.00192 • Published • 31
- 
	
	
	2.5 Years in Class: A Multimodal Textbook for Vision-Language PretrainingPaper • 2501.00958 • Published • 107
- 
	
	
	Xmodel-2 Technical ReportPaper • 2412.19638 • Published • 26
- 
	
	
	HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMsPaper • 2412.18925 • Published • 104
- 
	
	
	Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language ModelPaper • 2407.07053 • Published • 47
- 
	
	
	LMMs-Eval: Reality Check on the Evaluation of Large Multimodal ModelsPaper • 2407.12772 • Published • 35
- 
	
	
	VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality ModelsPaper • 2407.11691 • Published • 15
- 
	
	
	MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language ModelsPaper • 2408.02718 • Published • 62
- 
	
	
	Gemini Robotics: Bringing AI into the Physical WorldPaper • 2503.20020 • Published • 29
- 
	
	
	Magma: A Foundation Model for Multimodal AI AgentsPaper • 2502.13130 • Published • 58
- 
	
	
	LLaVA-Plus: Learning to Use Tools for Creating Multimodal AgentsPaper • 2311.05437 • Published • 51
- 
	
	
	OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsPaper • 2410.23218 • Published • 49
- 
	
	
	Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP ResearchersPaper • 2409.04109 • Published • 48
- 
	
	
	Training Language Models to Self-Correct via Reinforcement LearningPaper • 2409.12917 • Published • 140
- 
	
	
	Reward-Robust RLHF in LLMsPaper • 2409.15360 • Published • 6
- 
	
	
	EuroLLM: Multilingual Language Models for EuropePaper • 2409.16235 • Published • 28
- 
	
	
	EVA-CLIP-18B: Scaling CLIP to 18 Billion ParametersPaper • 2402.04252 • Published • 28
- 
	
	
	Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation ModelsPaper • 2402.03749 • Published • 14
- 
	
	
	ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingPaper • 2402.04615 • Published • 44
- 
	
	
	EfficientViT-SAM: Accelerated Segment Anything Model Without Performance LossPaper • 2402.05008 • Published • 23
- 
	
	
	Gemini Robotics: Bringing AI into the Physical WorldPaper • 2503.20020 • Published • 29
- 
	
	
	Magma: A Foundation Model for Multimodal AI AgentsPaper • 2502.13130 • Published • 58
- 
	
	
	LLaVA-Plus: Learning to Use Tools for Creating Multimodal AgentsPaper • 2311.05437 • Published • 51
- 
	
	
	OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsPaper • 2410.23218 • Published • 49
- 
	
	
	MLLM-as-a-Judge for Image Safety without Human LabelingPaper • 2501.00192 • Published • 31
- 
	
	
	2.5 Years in Class: A Multimodal Textbook for Vision-Language PretrainingPaper • 2501.00958 • Published • 107
- 
	
	
	Xmodel-2 Technical ReportPaper • 2412.19638 • Published • 26
- 
	
	
	HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMsPaper • 2412.18925 • Published • 104
- 
	
	
	Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP ResearchersPaper • 2409.04109 • Published • 48
- 
	
	
	Training Language Models to Self-Correct via Reinforcement LearningPaper • 2409.12917 • Published • 140
- 
	
	
	Reward-Robust RLHF in LLMsPaper • 2409.15360 • Published • 6
- 
	
	
	EuroLLM: Multilingual Language Models for EuropePaper • 2409.16235 • Published • 28
- 
	
	
	Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language ModelPaper • 2407.07053 • Published • 47
- 
	
	
	LMMs-Eval: Reality Check on the Evaluation of Large Multimodal ModelsPaper • 2407.12772 • Published • 35
- 
	
	
	VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality ModelsPaper • 2407.11691 • Published • 15
- 
	
	
	MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language ModelsPaper • 2408.02718 • Published • 62
 
							
							