Spaces:
Sleeping
title: Cross-Modal Object Comparison Tool
emoji: ๐
colorFrom: green
colorTo: yellow
sdk: docker
pinned: true
short_description: Demo of Image <-> 3D <-> Text retrival tool for AI Challenge
license: mit
๐ Cross-Modal 3D Asset Retrieval & Comparison Tool
An advanced, full-stack application designed to manage and analyze multi-modal datasets containing 3D models, images, and text descriptions. The tool leverages deep learning models to compute and compare embeddings across different modalities, enabling powerful cross-modal search and retrieval.
The interface allows users to upload their own datasets, explore a pre-loaded shared dataset, and perform detailed comparisons to find the most similar assets, regardless of their original format.
โจ Key Features
- ๐๏ธ Multi-Modal Dataset Management: Upload
.ziparchives containing images (.png), text (.txt), and 3D models (.stl). The system automatically processes and indexes them. - โ๏ธ Cloud & Local Datasets: Seamlessly switch between a large, pre-processed shared dataset hosted on the server and local datasets stored securely in your browser's IndexedDB.
- ๐๏ธ Interactive Content Viewer:
- A high-performance 3D viewer for
.stlmodels with zoom/pan/rotate controls, powered by Three.js. - Integrated image and text viewers.
- Fullscreen mode for detailed inspection of any asset.
- A high-performance 3D viewer for
- ๐ง Powerful Cross-Modal Comparison:
- Dataset Item Search: Select any item within a dataset to instantly see its top matches across all other modalities based on semantic similarity.
- Ad-Hoc Search: Upload a new, external image, 3D model, or text snippet to find the most similar items within a selected dataset.
- ๐ Full Analysis Export: Download the complete, pre-computed similarity matrix for any processed dataset as a
.jsonor.csvfile for offline analysis and reporting. - โก Responsive & Modern UI: A clean, fast, and intuitive user interface built with React, TypeScript, and TailwindCSS.
- ๐ High-Performance Backend: Powered by FastAPI and PyTorch, the backend is optimized for asynchronous operations and efficient deep learning inference.
๐ ๏ธ Technical Stack
| Area | Technology |
|---|---|
| Frontend | React 19, TypeScript, TailwindCSS, Three.js, IndexedDB |
| Backend | Python 3.10, FastAPI, PyTorch, Uvicorn, scikit-learn |
| Deployment | Docker, Hugging Face Spaces (or any container-based platform) |
๐๏ธ Project Architecture
The application is architected as a modern monorepo with a clear separation between the frontend and backend services, designed for containerization and easy deployment.
Frontend (/frontend)
A standalone Single-Page Application (SPA) built with React.
components/: Contains reusable UI components, organized by feature (e.g.,DatasetManager,ComparisonTool,common/).services/: Handles all side effects and external communication.apiService.ts: Manages all HTTP requests to the backend API.dbService.ts: Provides a simple interface for interacting with the browser's IndexedDB for local dataset persistence.comparisonService.ts: Logic for handling client-side interactions with pre-computed similarity data.
types.ts: Centralized TypeScript type definitions for robust data modeling.App.tsx: The main application component that orchestrates state and views.
Backend (/backend)
A high-performance API server built with FastAPI.
main.py: The main entry point for the FastAPI application. It defines all API endpoints, manages application lifecycle events (like model loading on startup), and serves the static frontend files.inference_utils.py: The core of the AI logic. It handles ZIP file processing, asset parsing, embedding generation using the PyTorch models, and similarity calculation (cosine similarity). It also manages an in-memory cache for embeddings to ensure fast retrieval.download_utils.py: A utility module for downloading model weights and shared datasets from external storage (e.g., Yandex.Disk) during the startup phase.cad_retrieval_utils/: A proprietary library containing the core model definitions, data loaders, and training/inference configurations for the cross-modal retrieval task.ReConV2/: A dependency containing model architectures and potentially C++ extensions for efficient 3D point cloud processing.
โ๏ธ How It Works
The core workflow for processing a new dataset is as follows:
- Upload: The user uploads a
.zipfile via the React frontend. - API Request: The frontend sends the file to the
/api/process-datasetendpoint on the FastAPI backend. - Unpacking & Preprocessing: The backend saves the archive to a temporary directory and extracts all image, text, and mesh files.
- Embedding Generation: For each file, a specialized PyTorch model generates a high-dimensional vector embedding:
- An Image Encoder processes
.pngfiles. - A Text Encoder processes
.txtfiles. - A Point Cloud (PC) Encoder processes
.stlfiles after converting them to point clouds.
- An Image Encoder processes
- Caching: The generated embeddings and asset metadata are stored in an in-memory cache on the server for instant access.
- Full Comparison: The backend pre-computes a full N x N similarity matrix by calculating the cosine similarity between every pair of embeddings.
- Response & Client-Side Storage: The fully processed dataset object, including the comparison matrix, is sent back to the client. The frontend then saves this complete dataset to IndexedDB, making it available for future sessions without needing to re-upload.
๐ Getting Started
You can run this project locally using Docker, which encapsulates both the frontend and backend services.
Prerequisites
- Docker installed on your machine.
Local Installation & Startup
Clone the repository:
git clone <your-repository-url> cd <repository-name>Check Model & Data URLs: The application is configured to download pre-trained models and a shared dataset from public URLs. Please verify the links inside
backend/main.pyand replace them with your own if necessary.Build and run with Docker: The provided
Dockerfileis a multi-stage build that compiles the frontend and sets up the Python backend in a single, optimized image.# Build the Docker image docker build -t cross-modal-retrieval . # Run the container docker run -p 7860:7860 cross-modal-retrievalAccess the application: Open your browser and navigate to http://localhost:7860.
๐ก Future Improvements
- Support for More Formats: Extend file support to
.obj/.glbfor 3D models and.jpeg/.webpfor images. - Advanced Search: Implement more complex filtering and search options within the dataset viewer (e.g., by similarity score, item count).
- Embedding Visualization: Add a new section to visualize the high-dimensional embedding space using techniques like t-SNE or UMAP.
- User Authentication: Introduce user accounts to manage private datasets and share them with collaborators.
- Model Fine-tuning: Allow users to fine-tune the retrieval models on their own datasets to improve domain-specific accuracy.
๐ License
This project is licensed under the MIT License. See the LICENSE file for details.