Inference-time Physics Alignment of Video Generative Models with Latent World Models
Abstract
Latent world models enhance video generation physics plausibility through inference-time alignment and trajectory steering, achieving superior performance in challenging benchmarks.
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
Community
I assume I can also use any arbitrary off-the-shelf metric as a substitute for 'surprise' then?🤣
Coming soon: Inference-time Overall Alignment of Video Generative Models with Random Seed.😎
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance (2026)
- What Happens Next? Next Scene Prediction with a Unified Video Model (2025)
- STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows (2025)
- Video Generation Models Are Good Latent Reward Models (2025)
- TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment (2026)
- GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment (2025)
- Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper