Upload folder using huggingface_hub
Browse files- README.md +142 -3
- config.json +60 -0
- model.safetensors +3 -0
README.md
CHANGED
|
@@ -1,3 +1,142 @@
|
|
| 1 |
-
---
|
| 2 |
-
license:
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- depth-estimation
|
| 5 |
+
- computer-vision
|
| 6 |
+
- monocular-depth
|
| 7 |
+
- multi-view-geometry
|
| 8 |
+
- pose-estimation
|
| 9 |
+
library_name: depth-anything-3
|
| 10 |
+
pipeline_tag: depth-estimation
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Depth Anything 3: DA3-LARGE
|
| 14 |
+
|
| 15 |
+
<div align="center">
|
| 16 |
+
|
| 17 |
+
[](https://depth-anything-3.github.io)
|
| 18 |
+
[](https://arxiv.org/abs/)
|
| 19 |
+
[](https://huggingface.co/spaces/depth-anything/Depth-Anything-3) # noqa: E501
|
| 20 |
+
<!-- Benchmark badge removed as per request -->
|
| 21 |
+
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
+
## Model Description
|
| 25 |
+
|
| 26 |
+
DA3 Large model for multi-view depth estimation and camera pose estimation. Foundation model with unified depth-ray representation.
|
| 27 |
+
|
| 28 |
+
| Property | Value |
|
| 29 |
+
|----------|-------|
|
| 30 |
+
| **Model Series** | Any-view Model |
|
| 31 |
+
| **Parameters** | 0.35B |
|
| 32 |
+
| **License** | Apache 2.0 |
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
## Capabilities
|
| 37 |
+
|
| 38 |
+
- ✅ Relative Depth
|
| 39 |
+
- ✅ Pose Estimation
|
| 40 |
+
- ✅ Pose Conditioning
|
| 41 |
+
|
| 42 |
+
## Quick Start
|
| 43 |
+
|
| 44 |
+
### Installation
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
git clone https://github.com/ByteDance-Seed/depth-anything-3
|
| 48 |
+
cd depth-anything-3
|
| 49 |
+
pip install -e .
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### Basic Example
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
import torch
|
| 56 |
+
from depth_anything_3.api import DepthAnything3
|
| 57 |
+
|
| 58 |
+
# Load model from Hugging Face Hub
|
| 59 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 60 |
+
model = DepthAnything3.from_pretrained("depth-anything/da3-large")
|
| 61 |
+
model = model.to(device=device)
|
| 62 |
+
|
| 63 |
+
# Run inference on images
|
| 64 |
+
images = ["image1.jpg", "image2.jpg"] # List of image paths, PIL Images, or numpy arrays
|
| 65 |
+
prediction = model.inference(
|
| 66 |
+
images,
|
| 67 |
+
export_dir="output",
|
| 68 |
+
export_format="glb" # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
# Access results
|
| 72 |
+
print(prediction.depth.shape) # Depth maps: [N, H, W] float32
|
| 73 |
+
print(prediction.conf.shape) # Confidence maps: [N, H, W] float32
|
| 74 |
+
print(prediction.extrinsics.shape) # Camera poses (w2c): [N, 3, 4] float32
|
| 75 |
+
print(prediction.intrinsics.shape) # Camera intrinsics: [N, 3, 3] float32
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Command Line Interface
|
| 79 |
+
|
| 80 |
+
```bash
|
| 81 |
+
# Process images with auto mode
|
| 82 |
+
da3 auto path/to/images \
|
| 83 |
+
--export-format glb \
|
| 84 |
+
--export-dir output \
|
| 85 |
+
--model-dir depth-anything/da3-large
|
| 86 |
+
|
| 87 |
+
# Use backend for faster repeated inference
|
| 88 |
+
da3 backend --model-dir depth-anything/da3-large
|
| 89 |
+
da3 auto path/to/images --export-format glb --use-backend
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
## Model Details
|
| 93 |
+
|
| 94 |
+
- **Developed by:** ByteDance Seed Team
|
| 95 |
+
- **Model Type:** Vision Transformer for Visual Geometry
|
| 96 |
+
- **Architecture:** Plain transformer with unified depth-ray representation
|
| 97 |
+
- **Training Data:** Public academic datasets only
|
| 98 |
+
|
| 99 |
+
### Key Insights
|
| 100 |
+
|
| 101 |
+
💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization. # noqa: E501
|
| 102 |
+
|
| 103 |
+
✨ A singular **depth-ray representation** obviates the need for complex multi-task learning.
|
| 104 |
+
|
| 105 |
+
## Performance
|
| 106 |
+
|
| 107 |
+
🏆 Depth Anything 3 significantly outperforms:
|
| 108 |
+
- **Depth Anything 2** for monocular depth estimation
|
| 109 |
+
- **VGGT** for multi-view depth estimation and pose estimation
|
| 110 |
+
|
| 111 |
+
For detailed benchmarks, please refer to our [paper](https://depth-anything-3.github.io). # noqa: E501
|
| 112 |
+
|
| 113 |
+
## Limitations
|
| 114 |
+
|
| 115 |
+
- The model is trained on academic datasets and may have limitations on certain domain-specific images # noqa: E501
|
| 116 |
+
- Performance may vary depending on image quality, lighting conditions, and scene complexity
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
## Citation
|
| 120 |
+
|
| 121 |
+
If you find Depth Anything 3 useful in your research or projects, please cite:
|
| 122 |
+
|
| 123 |
+
```bibtex
|
| 124 |
+
@article{depthanything3,
|
| 125 |
+
title={Depth Anything 3: Recovering the visual space from any views},
|
| 126 |
+
author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang}, # noqa: E501
|
| 127 |
+
journal={arXiv preprint arXiv:XXXX.XXXXX},
|
| 128 |
+
year={2025}
|
| 129 |
+
}
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
## Links
|
| 133 |
+
|
| 134 |
+
- 🏠 [Project Page](https://depth-anything-3.github.io)
|
| 135 |
+
- 📄 [Paper](https://arxiv.org/abs/)
|
| 136 |
+
- 💻 [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3)
|
| 137 |
+
- 🤗 [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3)
|
| 138 |
+
- 📚 [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation)
|
| 139 |
+
|
| 140 |
+
## Authors
|
| 141 |
+
|
| 142 |
+
[Haotong Lin](https://haotongl.github.io/) · [Sili Chen](https://github.com/SiliChen321) · [Junhao Liew](https://liewjunhao.github.io/) · [Donny Y. Chen](https://donydchen.github.io) · [Zhenyu Li](https://zhyever.github.io/) · [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) · [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) · [Bingyi Kang](https://bingykang.github.io/) # noqa: E501
|
config.json
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_name": "da3-large",
|
| 3 |
+
"config": {
|
| 4 |
+
"__object__": {
|
| 5 |
+
"path": "depth_anything_3.model.da3",
|
| 6 |
+
"name": "DepthAnything3Net",
|
| 7 |
+
"args": "as_params"
|
| 8 |
+
},
|
| 9 |
+
"net": {
|
| 10 |
+
"__object__": {
|
| 11 |
+
"path": "depth_anything_3.model.dinov2.dinov2",
|
| 12 |
+
"name": "DinoV2",
|
| 13 |
+
"args": "as_params"
|
| 14 |
+
},
|
| 15 |
+
"name": "vitl",
|
| 16 |
+
"out_layers": [
|
| 17 |
+
11,
|
| 18 |
+
15,
|
| 19 |
+
19,
|
| 20 |
+
23
|
| 21 |
+
],
|
| 22 |
+
"alt_start": 8,
|
| 23 |
+
"qknorm_start": 8,
|
| 24 |
+
"rope_start": 8,
|
| 25 |
+
"cat_token": true
|
| 26 |
+
},
|
| 27 |
+
"head": {
|
| 28 |
+
"__object__": {
|
| 29 |
+
"path": "depth_anything_3.model.dualdpt",
|
| 30 |
+
"name": "DualDPT",
|
| 31 |
+
"args": "as_params"
|
| 32 |
+
},
|
| 33 |
+
"dim_in": 2048,
|
| 34 |
+
"output_dim": 2,
|
| 35 |
+
"features": 256,
|
| 36 |
+
"out_channels": [
|
| 37 |
+
256,
|
| 38 |
+
512,
|
| 39 |
+
1024,
|
| 40 |
+
1024
|
| 41 |
+
]
|
| 42 |
+
},
|
| 43 |
+
"cam_enc": {
|
| 44 |
+
"__object__": {
|
| 45 |
+
"path": "depth_anything_3.model.cam_enc",
|
| 46 |
+
"name": "CameraEnc",
|
| 47 |
+
"args": "as_params"
|
| 48 |
+
},
|
| 49 |
+
"dim_out": 1024
|
| 50 |
+
},
|
| 51 |
+
"cam_dec": {
|
| 52 |
+
"__object__": {
|
| 53 |
+
"path": "depth_anything_3.model.cam_dec",
|
| 54 |
+
"name": "CameraDec",
|
| 55 |
+
"args": "as_params"
|
| 56 |
+
},
|
| 57 |
+
"dim_in": 2048
|
| 58 |
+
}
|
| 59 |
+
}
|
| 60 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eaf2ae06df55889ad23eb245c82e2dd2a30c0cbf7e3d873a118fa5ed27a3e421
|
| 3 |
+
size 1643843860
|