Play to Generalize: Learning to Reason Through Game Play
Abstract
Post-training multimodal large language models with reinforcement learning on arcade-like games enhances their multimodal reasoning abilities without domain-specific data.
Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.
Community
Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.
A video showing how our model beats Gemini-2.5-pro at playing the snake game. For more results, please see https://yunfeixie233.github.io/ViGaL/.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning (2025)
- Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning (2025)
- G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning (2025)
- KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning (2025)
- Behavior Injection: Preparing Language Models for Reinforcement Learning (2025)
- Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning (2025)
- Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 
					 
					 
						 
					 
					