ZeroStereo: Zero-shot Stereo Matching from Single Images

This repository hosts the StereoGen model, a key component of the ZeroStereo framework. ZeroStereo introduces a novel pipeline for zero-shot stereo matching, capable of synthesizing high-quality right images from arbitrary single images. It achieves this by leveraging pseudo disparities generated by a monocular depth estimation model and fine-tuning a diffusion inpainting model to recover missing details while preserving semantic structure.

Paper

The model was presented in the paper ZeroStereo: Zero-shot Stereo Matching from Single Images.

Abstract

State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow.

Code

The official code, detailed instructions for fine-tuning, generation, training, and evaluation, can be found on the GitHub repository.

Pre-Trained Models

The following pre-trained models are available related to this project:

Model	Link
SDv2I	Download 🤗
StereoGen	Download 🤗
Zero-RAFT-Stereo	Download 🤗
Zero-IGEV-Stereo	Download 🤗

Usage

You can load the StereoGen model using the diffusers library. Please note that for full inference functionality involving detailed pre-processing (e.g., input image, depth maps, masks, etc.), you should refer to the official GitHub repository as the process might involve multiple steps.

First, ensure you have the diffusers library and its dependencies installed:

pip install diffusers transformers accelerate torch

Here's a basic example to load the pipeline:

import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load the StereoGen pipeline.
# This model synthesizes a right stereo image from a single left input image.
pipeline = DiffusionPipeline.from_pretrained("Windsrain/ZeroStereo", torch_dtype=torch.float16)

# Move pipeline to GPU if available
if torch.cuda.is_available():
    pipeline.to("cuda")

# Example placeholder for input image.
# Replace with your actual left input image.
# For full usage (e.g., generating required depth maps or masks),
# please refer to the project's GitHub repository.
# input_image = Image.open("path/to/your/left_image.png").convert("RGB")
input_image = Image.new('RGB', (512, 512), color = 'blue') # Dummy image for demonstration

print("Model loaded successfully. For detailed inference and generation scripts,")
print("refer to the official GitHub repository: https://github.com/Windsrain/ZeroStereo")

# The actual inference call to generate a stereo image might require specific inputs
# (e.g., `image`, `depth_map`, `mask_image`) depending on the pipeline's internal
# implementation as shown in the project's GitHub demo/generation scripts.
# Example inference might look like:
# generated_right_image = pipeline(image=input_image, depth_map=some_depth, mask_image=some_mask).images[0]
# generated_right_image.save("generated_stereo_right.png")

Acknowledgement

This project is based on MfS-Stereo, Depth Anything V2, Marigold, RAFT-Stereo, and IGEV-Stereo. We thank the original authors for their excellent works.

Citation

If you find this work helpful, please cite the paper:

@article{wang2025zerostereo,
  title={ZeroStereo: Zero-shot Stereo Matching from Single Images},
  author={Wang, Xianqi and Yang, Hao and Xu, Gangwei and Cheng, Junda and Lin, Min and Deng, Yong and Zang, Jinliang and Chen, Yurui and Yang, Xin},
  journal={arXiv preprint arXiv:2501.08654},
  year={2025},
}

Downloads last month: -