Intial Commit

Files changed (16) hide show

.gitattributes +9 -0
.gitignore +4 -0
README.md +114 -0
adapter_config.json +3 -0
adapter_model.safetensors +3 -0
added_tokens.json +3 -0
chat_template.json +3 -0
generation_config.json +3 -0
handler.py +168 -0
merges.txt +0 -0
preprocessor_config.json +3 -0
requirements.txt +4 -0
special_tokens_map.json +3 -0
tokenizer.json +3 -0
tokenizer_config.json +3 -0
vocab.json +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+chat_template.json filter=lfs diff=lfs merge=lfs -text
+generation_config.json filter=lfs diff=lfs merge=lfs -text
+preprocessor_config.json filter=lfs diff=lfs merge=lfs -text
+tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
+adapter_config.json filter=lfs diff=lfs merge=lfs -text
+added_tokens.json filter=lfs diff=lfs merge=lfs -text
+special_tokens_map.json filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+vocab.json filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+*.DS*
+*__pycache__*
+*.pdf
+*.ipynb

README.md CHANGED Viewed

@@ -1,3 +1,117 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+# EndpointHandler
+`EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features.
+## Overview
+- **Efficient Document Retrieval**: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval.
+- **Multi-vector Representation**: Generates ColBERT-style multi-vector embeddings for improved similarity search.
+- **Flexible Image Resolution**: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency.
+- **Device Compatibility**: Automatically utilizes available CUDA devices or defaults to CPU.
+## Model Details
+The **ColQwen2** model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include:
+- **Training**: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token.
+- **Input Flexibility**: Handles various image resolutions without resizing, ensuring accurate multi-vector representation.
+- **Similarity Scoring**: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities.
+This base version is untrained, providing deterministic initialization of the projection layer for further customization.
+## How to Use
+The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler.
+### Example Script
+```python
+import torch
+from pdf2image import convert_from_path
+import base64
+from io import BytesIO
+import requests
+# Function to convert PIL Image to base64 string
+def pil_image_to_base64(image):
+    """Converts a PIL Image to a base64 encoded string."""
+    buffer = BytesIO()
+    image.save(buffer, format="PNG")
+    return base64.b64encode(buffer.getvalue()).decode()
+# Function to convert PDF pages to base64 images
+def convert_pdf_to_base64_images(pdf_path):
+    """Converts PDF pages to base64 encoded images."""
+    pages = convert_from_path(pdf_path)
+    return [pil_image_to_base64(page) for page in pages]
+# Function to send payload to API and retrieve response
+def query_api(payload, api_url, headers):
+    """Sends a POST request to the API and returns the response."""
+    response = requests.post(api_url, headers=headers, json=payload)
+    return response.json()
+# Main execution
+if __name__ == "__main__":
+    # Convert PDF pages to base64 encoded images
+    encoded_images = convert_pdf_to_base64_images('document.pdf')
+    # Prepare payload
+    payload = {
+        "inputs": [],
+        "image": encoded_images,
+        "text": ["example query text"]
+    }
+    # API configuration
+    API_URL = "https://your-api-url"
+    headers = {
+        "Accept": "application/json",
+        "Authorization": "Bearer your_access_token",
+        "Content-Type": "application/json"
+    }
+    # Query the API and get output
+    output = query_api(payload=payload, api_url=API_URL, headers=headers)
+    print(output)
+```
+## Inputs and Outputs
+### Input Format
+The `EndpointHandler` expects a dictionary containing:
+- **image**: A list of base64-encoded strings for images (e.g., PDF pages converted to images).
+- **text**: A list of text strings representing queries or document contents.
+- **batch_size** (optional): The batch size for processing images and text. Defaults to `4`.
+Example payload:
+```json
+{
+    "image": ["base64_image_string_1", "base64_image_string_2"],
+    "text": ["sample text 1", "sample text 2"],
+    "batch_size": 4
+}
+```
+### Output Format
+The handler returns a dictionary with the following keys:
+- **image**: List of embeddings for each image.
+- **text**: List of embeddings for each text entry.
+- **scores**: List of similarity scores between the image and text embeddings.
+Example output:
+```json
+{
+    "image": [[0.12, 0.34, ...], [0.56, 0.78, ...]],
+    "text": [[0.11, 0.22, ...], [0.33, 0.44, ...]],
+    "scores": [[0.87, 0.45], [0.23, 0.67]]
+}
+```
+### Error Handling
+If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary.

adapter_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c88fb289a155188a09737629830dc32e753bb679d6bddd5f94ddf9daa1921114
+size 727

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc856312174dc99a4c7f88a2c54d9590a3b3f5b5a86e2728d7138c7f4758c6d5
+size 74018232

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fa54985b58718a8fdb4f4d97484c4bd908db114847675e4bf3afe3e1d5d7bd4
+size 392

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:94174d7176c52a7192f96fc34eb2cf23c7c2059d63cdbfadca1586ba89731fb7
+size 1049

generation_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f31bc5c808ee15908986654279dd054f3e6bd65d52f8ca7b18a2a80552e2d35b
+size 215

handler.py ADDED Viewed

	@@ -0,0 +1,168 @@

+import torch
+from typing import Dict, Any, List
+from PIL import Image
+import base64
+from io import BytesIO
+import logging
+class EndpointHandler:
+    """
+    A handler class for processing image and text data, generating embeddings using a specified model and processor.
+    Attributes:
+        model: The pre-trained model used for generating embeddings.
+        processor: The pre-trained processor used to process images and text before model inference.
+        device: The device (CPU or CUDA) used to run model inference.
+        default_batch_size: The default batch size for processing images and text in batches.
+    """
+    def __init__(self, path: str = "", default_batch_size: int = 4):
+        """
+        Initializes the EndpointHandler with a specified model path and default batch size.
+        Args:
+            path (str): Path to the pre-trained model and processor.
+            default_batch_size (int): Default batch size for processing images and text data.
+        """
+        # Initialize logging
+        logging.basicConfig(level=logging.INFO)
+        self.logger = logging.getLogger(__name__)
+        from colpali_engine.models import ColQwen2, ColQwen2Processor
+        self.logger.info("Initializing model and processor.")
+        try:
+            self.model = ColQwen2.from_pretrained(
+                path,
+                torch_dtype=torch.bfloat16,
+                device_map="auto",
+            ).eval()
+            self.processor = ColQwen2Processor.from_pretrained(path)
+            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+            self.model.to(self.device)
+            self.default_batch_size = default_batch_size
+            self.logger.info("Initialization complete.")
+        except Exception as e:
+            self.logger.error(f"Failed to initialize model or processor: {e}")
+            raise
+    def _process_image_batch(self, images: List[Image.Image]) -> List[List[float]]:
+        """
+        Processes a batch of images and generates embeddings.
+        Args:
+            images (List[Image.Image]): List of images to process.
+        Returns:
+            List[List[float]]: List of embeddings for each image.
+        """
+        self.logger.debug(f"Processing batch of {len(images)} images.")
+        try:
+            batch_images = self.processor.process_images(images).to(self.device)
+            with torch.no_grad():
+                image_embeddings = self.model(**batch_images)
+            self.logger.debug("Image batch processing complete.")
+            return image_embeddings.cpu().tolist()
+        except Exception as e:
+            self.logger.error(f"Error processing image batch: {e}")
+            raise
+    def _process_text_batch(self, texts: List[str]) -> List[List[float]]:
+        """
+        Processes a batch of text queries and generates embeddings.
+        Args:
+            texts (List[str]): List of text queries to process.
+        Returns:
+            List[List[float]]: List of embeddings for each text query.
+        """
+        self.logger.debug(f"Processing batch of {len(texts)} text queries.")
+        try:
+            batch_queries = self.processor.process_queries(texts).to(self.device)
+            with torch.no_grad():
+                query_embeddings = self.model(**batch_queries)
+            self.logger.debug("Text batch processing complete.")
+            return query_embeddings.cpu().tolist()
+        except Exception as e:
+            self.logger.error(f"Error processing text batch: {e}")
+            raise
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Processes input data containing base64-encoded images and text queries, decodes them, and generates embeddings.
+        Args:
+            data (Dict[str, Any]): Dictionary containing input images, text queries, and optional batch size.
+        Returns:
+            Dict[str, Any]: Dictionary containing generated embeddings for images and text or error messages.
+        """
+        images_data = data.get("image", [])
+        text_data = data.get("text", [])
+        batch_size = data.get("batch_size", self.default_batch_size)
+        # Decode and process images
+        images = []
+        if images_data:
+            self.logger.info("Decoding images from base64.")
+            for img_data in images_data:
+                if isinstance(img_data, str):
+                    try:
+                        image_bytes = base64.b64decode(img_data)
+                        image = Image.open(BytesIO(image_bytes)).convert("RGB")
+                        images.append(image)
+                    except Exception as e:
+                        self.logger.error(f"Invalid image data: {e}")
+                        return {"error": f"Invalid image data: {e}"}
+                else:
+                    self.logger.error("Images should be base64-encoded strings.")
+                    return {"error": "Images should be base64-encoded strings."}
+        image_embeddings = []
+        if images:
+            self.logger.info("Processing image embeddings.")
+            try:
+                for i in range(0, len(images), batch_size):
+                    batch_images = images[i : i + batch_size]
+                    batch_embeddings = self._process_image_batch(batch_images)
+                    image_embeddings.extend(batch_embeddings)
+            except Exception as e:
+                self.logger.error(f"Error generating image embeddings: {e}")
+                return {"error": f"Error generating image embeddings: {e}"}
+        # Process text data
+        text_embeddings = []
+        if text_data:
+            self.logger.info("Processing text embeddings.")
+            try:
+                for i in range(0, len(text_data), batch_size):
+                    batch_texts = text_data[i : i + batch_size]
+                    batch_text_embeddings = self._process_text_batch(batch_texts)
+                    text_embeddings.extend(batch_text_embeddings)
+            except Exception as e:
+                self.logger.error(f"Error generating text embeddings: {e}")
+                return {"error": f"Error generating text embeddings: {e}"}
+        # Compute similarity scores if both image and text embeddings are available
+        scores = []
+        if image_embeddings and text_embeddings:
+            self.logger.info("Computing similarity scores.")
+            try:
+                image_embeddings_tensor = torch.tensor(image_embeddings).to(self.device)
+                text_embeddings_tensor = torch.tensor(text_embeddings).to(self.device)
+                with torch.no_grad():
+                    scores = (
+                        self.processor.score_multi_vector(
+                            text_embeddings_tensor, image_embeddings_tensor
+                        )
+                        .cpu()
+                        .tolist()
+                    )
+                self.logger.info("Similarity scoring complete.")
+            except Exception as e:
+                self.logger.error(f"Error computing similarity scores: {e}")
+                return {"error": f"Error computing similarity scores: {e}"}
+        return {"image": image_embeddings, "text": text_embeddings, "scores": scores}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5dd5968b65af7e090e399f39ae94734e400d9d71a3c82fca2720c5ee514034f3
+size 568

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+colpali-engine==0.3.3
+pdf2image
+GPUtil
+accelerate==0.30.1

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76862e765266b85aa9459767e33cbaf13970f327a0e88d1c65846c2ddd3a1ecd
+size 613

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:091aa7594dc2fcfbfa06b9e3c22a5f0562ac14f30375c13af7309407a0e67b8a
+size 11420371

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:955409fb4dab09a71b957ce69f8a8185bbbd3416b9ab5a47e01221545be39c6f
+size 4298

vocab.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
+size 2776833