--- {} --- # Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation **[Paper](https://arxiv.org/abs/2509.19296), [Project Page](https://research.nvidia.com/labs/toronto-ai/lyra/)** [Sherwin Bahmani](https://sherwinbahmani.github.io/), [Tianchang Shen](https://www.cs.toronto.edu/~shenti11/), [Jiawei Ren](https://jiawei-ren.github.io/), [Jiahui Huang](https://huangjh-pub.github.io/), [Yifeng Jiang](https://cs.stanford.edu/~yifengj/), [Haithem Turki](https://haithemturki.com/), [Andrea Tagliasacchi](https://theialab.ca/), [David B. Lindell](https://davidlindell.com/), [Zan Gojcic](https://zgojcic.github.io/), [Sanja Fidler](https://www.cs.utoronto.ca/~fidler/), [Huan Ling](https://www.cs.toronto.edu/~linghuan/), [Jun Gao](https://www.cs.toronto.edu/~jungao/), [Xuanchi Ren](https://xuanchiren.com/)
### Description: Lyra is a 3D / 4D feed-forward 3D Gaussian Splatting (3DGS) reconstruction model. We achieve this by distilling a pre-trained video diffusion model into a feed-forward 3D Gaussian Splatting (3DGS) generator. Lyra circumvents the need for 3D datasets or model fine-tuning by leveraging the inherent 3D consistency of video outputs to align 2D renderings with a 3DGS decoder. By using the generated synthetic multi-view data, we train a decoder to operate directly in the video model's latent space to produce 3D Gaussians. This framework enables real-time rendering and establishes a new state of the art for 3D / 4D scene generation, supporting both single-view and video inputs. This model is ready for commercial use. ### License/Terms of Use This model is released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). For a custom license, please contact cosmos-license@nvidia.com. Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate. ### Deployment Geography: Global ### Use Case:
This model is intended for researchers developing 3D / 4D reconstruction techniques, and it allows them to generate a 3D / 4D scene from a single image. ### Release Date:
Github 09/23/2025 via https://github.com/nv-tlabs/lyra
## References(s): Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation **[Paper](), [Project Page](https://research.nvidia.com/labs/toronto-ai/lyra/)** ## Model Architecture: **Architecture Type:** Convolutional Neural Network (CNN), Transformer
**Network Architecture:** Transformer
This model was developed based on [Cosmos Predict 1](https://github.com/nvidia-cosmos/cosmos-predict1/tree/main).
Number of model parameters: 32.75M ## Input:
**Input Type(s):** Camera Parameters, Image
**Input Format(s):** One-Dimensional (1D) Array of Camera Poses, Two-Dimensional (2D) Array of Images.
**Input Parameters:** Camera Poses (1D), Images (2D)
**Other Properties Related to Input:** The input image should be 720 * 1080 resolution, and we recommend using 121 frames for the camera parameters.
## Output:
**Output Type(s):** Three-Dimensional (3D) Gaussian Scene
**Output Format:** Point cloud file (e.g., .ply)
**Output Parameters:** A set of M 3D Gaussians, where each Gaussian is defined by a collection of attributes.
**Other Properties Related to Output:** The output is not a sequence of 2D images but a set of 3D primitives used to render a scene. For each of the M Gaussians, the key properties are:
- Position (Mean): A 3D vector (x,y,z) defining the center of the Gaussian in 3D space.
- Covariance (Shape & Orientation): This defines the ellipsoid's shape and rotation. It's typically stored as a 3D scale vector (s_x, s_y, s_z) and a 4D rotation quaternion (r_w, r_x, r_y, r_z).
- Color: A 3-vector (R,G,B) representing the color of the Gaussian. This can also be represented by more complex Spherical Harmonics (SH) coefficients for view-dependent color effects.
- Opacity: A scalar value (α) that controls the transparency of the Gaussian.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems A100 and H100. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
## Software Integration: **Runtime Engine(s):** * [Cosmos-Predict1](https://github.com/nvidia-cosmos/cosmos-predict1)
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Ampere
* NVIDIA Blackwell
* NVIDIA Hopper
**Preferred/Supported Operating System(s):** * [Linux]
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. ## Model Version(s): -V1.0 # Inference: **Acceleration Engine:** [Cosmos-Predict1](https://github.com/nvidia-cosmos/cosmos-predict1) **Test Hardware:** * NVIDIA Ampere
* NVIDIA Blackwell
* NVIDIA Hopper
## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards. Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ### Plus Plus (++) Promise We value you, the datasets, the diversity they represent, and what we have been entrusted with. This model and its associated data have been: - Verified to comply with current applicable disclosure laws, regulations, and industry standards. - Verified to comply with applicable privacy labeling requirements. - Annotated to describe the collector/source (NVIDIA or a third-party). - Characterized for technical limitations. - Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests. - Reviewed before release. - Tagged for known restrictions and potential safety implications. ### Bias Field | Response :---------------------------------------------------------------------------------------------------|:--------------- Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None Measures taken to mitigate against unwanted bias: | None ### Explainability Field | Response :------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------- Intended Task/Domain: | Novel view synthesis, video generation Model Type: | Transformer Intended Users: | Physical AI developers. Output: | Three-Dimensional (3D) Gaussian Scene. Describe how the model works: | We take a video as input, encode it using Cosmos tokenizer to Cosmos latent space. We then use our model which is a transformer-like architecture to lift the latent to 3D Gaussians. Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable. Technical Limitations & Mitigation: | The proposed method relies only on synthetic data generated by Cosmos for training, **which ** might limit the generalization ability if the target scenario is not in the pre-generated SDG dataset. Verified to have met prescribed NVIDIA quality standards: | Yes Performance Metrics: | Qualitative and Quantitative Evaluation including PSNR, SSIM, LPIPS metrics. Potential Known Risks: | This model is only trained on synthetic data generated by Cosmos, and may inaccurately reconstruct a out-of-distribution video that is not in the synthetic data domain. Licensing: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) ### Privacy Field | Response :----------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------- Generatable or reverse engineerable personal data? | No Personal data used to create this model? | [None Known] Is there provenance for all datasets used in training? | Yes How often is dataset reviewed? | Before Release Does data labeling (annotation, metadata) comply with privacy laws? | Not Applicable Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ ### Safety Field | Response :---------------------------------------------------|:---------------------------------- Model Application Field(s): | World Generation Describe the life critical impact (if present). | Not Applicable
Use Case Restrictions: | Abide by [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. ## Citation ``` @inproceedings{bahmani2025lyra, title={Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation}, author={Bahmani, Sherwin and Shen, Tianchang and Ren, Jiawei and Huang, Jiahui and Jiang, Yifeng and Turki, Haithem and Tagliasacchi, Andrea and Lindell, David B. and Gojcic, Zan and Fidler, Sanja and Ling, Huan and Gao, Jun and Ren, Xuanchi}, booktitle={arXiv preprint arXiv:2509.19296}, year={2025} } ```