SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis
Abstract
SyncMV4D generates realistic and consistent multi-view 3D Hand-Object Interaction videos and 4D motions by integrating visual priors, motion dynamics, and multi-view geometry.
Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.
Community
- TL;DR: A novel method for synchronously generating multi-view hand-object interaction videos and 4D motion.
- Project page at https://droliven.github.io/SyncMV4D/.
- Video demonstration: https://youtu.be/G7pda3nmV70.
- TL;DR: A novel method for synchronously generating multi-view hand-object interaction videos and 4D motion.
- Project page at https://droliven.github.io/SyncMV4D/.
- Video demonstration: https://youtu.be/G7pda3nmV70.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion (2025)
- ShapeGen4D: Towards High Quality 4D Shape Generation from Videos (2025)
- PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention (2025)
- Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models (2025)
- MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning (2025)
- Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos (2025)
- WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper