Spaces:

Alexator26
/

project-demo

Sleeping

App Files Files Community

project-demo / README.md

Шатурный Алексей Давыдович

add files

0269f70 about 1 month ago

preview code

raw

history blame contribute delete

8.7 kB

metadata

title: Cross-Modal Object Comparison Tool
emoji: 👀
colorFrom: green
colorTo: yellow
sdk: docker
pinned: true
short_description: Demo of Image <-> 3D <-> Text retrival tool for AI Challenge
license: mit

🚀 Cross-Modal 3D Asset Retrieval & Comparison Tool

An advanced, full-stack application designed to manage and analyze multi-modal datasets containing 3D models, images, and text descriptions. The tool leverages deep learning models to compute and compare embeddings across different modalities, enabling powerful cross-modal search and retrieval.

The interface allows users to upload their own datasets, explore a pre-loaded shared dataset, and perform detailed comparisons to find the most similar assets, regardless of their original format.

✨ Key Features

🗂️ Multi-Modal Dataset Management: Upload .zip archives containing images (.png), text (.txt), and 3D models (.stl). The system automatically processes and indexes them.
☁️ Cloud & Local Datasets: Seamlessly switch between a large, pre-processed shared dataset hosted on the server and local datasets stored securely in your browser's IndexedDB.
👁️ Interactive Content Viewer:
- A high-performance 3D viewer for .stl models with zoom/pan/rotate controls, powered by Three.js.
- Integrated image and text viewers.
- Fullscreen mode for detailed inspection of any asset.
🧠 Powerful Cross-Modal Comparison:
- Dataset Item Search: Select any item within a dataset to instantly see its top matches across all other modalities based on semantic similarity.
- Ad-Hoc Search: Upload a new, external image, 3D model, or text snippet to find the most similar items within a selected dataset.
📊 Full Analysis Export: Download the complete, pre-computed similarity matrix for any processed dataset as a .json or .csv file for offline analysis and reporting.
⚡ Responsive & Modern UI: A clean, fast, and intuitive user interface built with React, TypeScript, and TailwindCSS.
🚀 High-Performance Backend: Powered by FastAPI and PyTorch, the backend is optimized for asynchronous operations and efficient deep learning inference.

🛠️ Technical Stack

Area	Technology
Frontend	React 19, TypeScript, TailwindCSS, Three.js, IndexedDB
Backend	Python 3.10, FastAPI, PyTorch, Uvicorn, scikit-learn
Deployment	Docker, Hugging Face Spaces (or any container-based platform)

🏛️ Project Architecture

The application is architected as a modern monorepo with a clear separation between the frontend and backend services, designed for containerization and easy deployment.

Frontend (`/frontend`)

A standalone Single-Page Application (SPA) built with React.

components/: Contains reusable UI components, organized by feature (e.g., DatasetManager, ComparisonTool, common/).
services/: Handles all side effects and external communication.
- apiService.ts: Manages all HTTP requests to the backend API.
- dbService.ts: Provides a simple interface for interacting with the browser's IndexedDB for local dataset persistence.
- comparisonService.ts: Logic for handling client-side interactions with pre-computed similarity data.
types.ts: Centralized TypeScript type definitions for robust data modeling.
App.tsx: The main application component that orchestrates state and views.

Backend (`/backend`)

A high-performance API server built with FastAPI.

main.py: The main entry point for the FastAPI application. It defines all API endpoints, manages application lifecycle events (like model loading on startup), and serves the static frontend files.
inference_utils.py: The core of the AI logic. It handles ZIP file processing, asset parsing, embedding generation using the PyTorch models, and similarity calculation (cosine similarity). It also manages an in-memory cache for embeddings to ensure fast retrieval.
download_utils.py: A utility module for downloading model weights and shared datasets from external storage (e.g., Yandex.Disk) during the startup phase.
cad_retrieval_utils/: A proprietary library containing the core model definitions, data loaders, and training/inference configurations for the cross-modal retrieval task.
ReConV2/: A dependency containing model architectures and potentially C++ extensions for efficient 3D point cloud processing.

⚙️ How It Works

The core workflow for processing a new dataset is as follows:

Upload: The user uploads a .zip file via the React frontend.
API Request: The frontend sends the file to the /api/process-dataset endpoint on the FastAPI backend.
Unpacking & Preprocessing: The backend saves the archive to a temporary directory and extracts all image, text, and mesh files.
Embedding Generation: For each file, a specialized PyTorch model generates a high-dimensional vector embedding:
- An Image Encoder processes .png files.
- A Text Encoder processes .txt files.
- A Point Cloud (PC) Encoder processes .stl files after converting them to point clouds.
Caching: The generated embeddings and asset metadata are stored in an in-memory cache on the server for instant access.
Full Comparison: The backend pre-computes a full N x N similarity matrix by calculating the cosine similarity between every pair of embeddings.
Response & Client-Side Storage: The fully processed dataset object, including the comparison matrix, is sent back to the client. The frontend then saves this complete dataset to IndexedDB, making it available for future sessions without needing to re-upload.

🚀 Getting Started

You can run this project locally using Docker, which encapsulates both the frontend and backend services.

Prerequisites

Docker installed on your machine.

Local Installation & Startup

Clone the repository:

git clone <your-repository-url>
cd <repository-name>

Check Model & Data URLs: The application is configured to download pre-trained models and a shared dataset from public URLs. Please verify the links inside backend/main.py and replace them with your own if necessary.
Build and run with Docker: The provided Dockerfile is a multi-stage build that compiles the frontend and sets up the Python backend in a single, optimized image.
```
# Build the Docker image
docker build -t cross-modal-retrieval .

# Run the container
docker run -p 7860:7860 cross-modal-retrieval
```
Access the application: Open your browser and navigate to http://localhost:7860.

💡 Future Improvements

Support for More Formats: Extend file support to .obj/.glb for 3D models and .jpeg/.webp for images.
Advanced Search: Implement more complex filtering and search options within the dataset viewer (e.g., by similarity score, item count).
Embedding Visualization: Add a new section to visualize the high-dimensional embedding space using techniques like t-SNE or UMAP.
User Authentication: Introduce user accounts to manage private datasets and share them with collaborators.
Model Fine-tuning: Allow users to fine-tune the retrieval models on their own datasets to improve domain-specific accuracy.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.