Spaces:
Sleeping
Sleeping
File size: 8,702 Bytes
bdcda2c 0269f70 bdcda2c 0269f70 bdcda2c 0269f70 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
title: Cross-Modal Object Comparison Tool
emoji: π
colorFrom: green
colorTo: yellow
sdk: docker
pinned: true
short_description: Demo of Image <-> 3D <-> Text retrival tool for AI Challenge
license: mit
---
# π Cross-Modal 3D Asset Retrieval & Comparison Tool
[](https://opensource.org/licenses/MIT)
[](https://react.dev/)
[](https://fastapi.tiangolo.com/)
[](https://pytorch.org/)
An advanced, full-stack application designed to manage and analyze multi-modal datasets containing 3D models, images, and text descriptions. The tool leverages deep learning models to compute and compare embeddings across different modalities, enabling powerful cross-modal search and retrieval.
The interface allows users to upload their own datasets, explore a pre-loaded shared dataset, and perform detailed comparisons to find the most similar assets, regardless of their original format.
---
## β¨ Key Features
- **ποΈ Multi-Modal Dataset Management**: Upload `.zip` archives containing images (`.png`), text (`.txt`), and 3D models (`.stl`). The system automatically processes and indexes them.
- **βοΈ Cloud & Local Datasets**: Seamlessly switch between a large, pre-processed shared dataset hosted on the server and local datasets stored securely in your browser's IndexedDB.
- **ποΈ Interactive Content Viewer**:
- A high-performance 3D viewer for `.stl` models with zoom/pan/rotate controls, powered by **Three.js**.
- Integrated image and text viewers.
- Fullscreen mode for detailed inspection of any asset.
- **π§ Powerful Cross-Modal Comparison**:
- **Dataset Item Search**: Select any item within a dataset to instantly see its top matches across all other modalities based on semantic similarity.
- **Ad-Hoc Search**: Upload a new, external image, 3D model, or text snippet to find the most similar items within a selected dataset.
- **π Full Analysis Export**: Download the complete, pre-computed similarity matrix for any processed dataset as a `.json` or `.csv` file for offline analysis and reporting.
- **β‘ Responsive & Modern UI**: A clean, fast, and intuitive user interface built with **React**, **TypeScript**, and **TailwindCSS**.
- **π High-Performance Backend**: Powered by **FastAPI** and **PyTorch**, the backend is optimized for asynchronous operations and efficient deep learning inference.
---
## π οΈ Technical Stack
| Area | Technology |
| :-------- | :---------------------------------------------------------------------------------------------------------- |
| **Frontend** | [React 19](https://react.dev/), [TypeScript](https://www.typescriptlang.org/), [TailwindCSS](https://tailwindcss.com/), [Three.js](https://threejs.org/), [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API) |
| **Backend** | [Python 3.10](https://www.python.org/), [FastAPI](https://fastapi.tiangolo.com/), [PyTorch](https://pytorch.org/), [Uvicorn](https://www.uvicorn.org/), [scikit-learn](https://scikit-learn.org/) |
| **Deployment**| [Docker](https://www.docker.com/), [Hugging Face Spaces](https://huggingface.co/spaces) (or any container-based platform) |
---
## ποΈ Project Architecture
The application is architected as a modern monorepo with a clear separation between the frontend and backend services, designed for containerization and easy deployment.
### Frontend (`/frontend`)
A standalone Single-Page Application (SPA) built with React.
- **`components/`**: Contains reusable UI components, organized by feature (e.g., `DatasetManager`, `ComparisonTool`, `common/`).
- **`services/`**: Handles all side effects and external communication.
- `apiService.ts`: Manages all HTTP requests to the backend API.
- `dbService.ts`: Provides a simple interface for interacting with the browser's IndexedDB for local dataset persistence.
- `comparisonService.ts`: Logic for handling client-side interactions with pre-computed similarity data.
- **`types.ts`**: Centralized TypeScript type definitions for robust data modeling.
- **`App.tsx`**: The main application component that orchestrates state and views.
### Backend (`/backend`)
A high-performance API server built with FastAPI.
- **`main.py`**: The main entry point for the FastAPI application. It defines all API endpoints, manages application lifecycle events (like model loading on startup), and serves the static frontend files.
- **`inference_utils.py`**: The core of the AI logic. It handles ZIP file processing, asset parsing, embedding generation using the PyTorch models, and similarity calculation (cosine similarity). It also manages an in-memory cache for embeddings to ensure fast retrieval.
- **`download_utils.py`**: A utility module for downloading model weights and shared datasets from external storage (e.g., Yandex.Disk) during the startup phase.
- **`cad_retrieval_utils/`**: A proprietary library containing the core model definitions, data loaders, and training/inference configurations for the cross-modal retrieval task.
- **`ReConV2/`**: A dependency containing model architectures and potentially C++ extensions for efficient 3D point cloud processing.
---
## βοΈ How It Works
The core workflow for processing a new dataset is as follows:
1. **Upload**: The user uploads a `.zip` file via the React frontend.
2. **API Request**: The frontend sends the file to the `/api/process-dataset` endpoint on the FastAPI backend.
3. **Unpacking & Preprocessing**: The backend saves the archive to a temporary directory and extracts all image, text, and mesh files.
4. **Embedding Generation**: For each file, a specialized PyTorch model generates a high-dimensional vector embedding:
- An **Image Encoder** processes `.png` files.
- A **Text Encoder** processes `.txt` files.
- A **Point Cloud (PC) Encoder** processes `.stl` files after converting them to point clouds.
5. **Caching**: The generated embeddings and asset metadata are stored in an in-memory cache on the server for instant access.
6. **Full Comparison**: The backend pre-computes a full N x N similarity matrix by calculating the cosine similarity between every pair of embeddings.
7. **Response & Client-Side Storage**: The fully processed dataset object, including the comparison matrix, is sent back to the client. The frontend then saves this complete dataset to IndexedDB, making it available for future sessions without needing to re-upload.
---
## π Getting Started
You can run this project locally using Docker, which encapsulates both the frontend and backend services.
### Prerequisites
- [Docker](https://www.docker.com/get-started) installed on your machine.
### Local Installation & Startup
1. **Clone the repository:**
```bash
git clone <your-repository-url>
cd <repository-name>
```
2. **Check Model & Data URLs:**
The application is configured to download pre-trained models and a shared dataset from public URLs. Please verify the links inside `backend/main.py` and replace them with your own if necessary.
3. **Build and run with Docker:**
The provided `Dockerfile` is a multi-stage build that compiles the frontend and sets up the Python backend in a single, optimized image.
```bash
# Build the Docker image
docker build -t cross-modal-retrieval .
# Run the container
docker run -p 7860:7860 cross-modal-retrieval
```
4. **Access the application:**
Open your browser and navigate to [http://localhost:7860](http://localhost:7860).
---
## π‘ Future Improvements
- **Support for More Formats**: Extend file support to `.obj`/`.glb` for 3D models and `.jpeg`/`.webp` for images.
- **Advanced Search**: Implement more complex filtering and search options within the dataset viewer (e.g., by similarity score, item count).
- **Embedding Visualization**: Add a new section to visualize the high-dimensional embedding space using techniques like t-SNE or UMAP.
- **User Authentication**: Introduce user accounts to manage private datasets and share them with collaborators.
- **Model Fine-tuning**: Allow users to fine-tune the retrieval models on their own datasets to improve domain-specific accuracy.
---
## π License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. |