{"cells":[{"cell_type":"markdown","metadata":{"id":"header"},"source":["# DeepSeek-OCR on Google Colab\n","\n","This notebook sets up and runs the DeepSeek-OCR model for optical character recognition.\n","\n","**Requirements:**\n","- GPU Runtime (T4 or better recommended)\n","- ~15-20 minutes setup time\n","\n","**Based on:** https://github.com/deepseek-ai/DeepSeek-OCR"]},{"cell_type":"markdown","metadata":{"id":"setup-header"},"source":["## 1. Environment Setup and GPU Check"]},{"cell_type":"code","execution_count":6,"metadata":{"id":"gpu-check","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052135367,"user_tz":-180,"elapsed":209,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"b7b51fa6-6b80-47bb-9b9d-bd8c45de233f"},"outputs":[{"output_type":"stream","name":"stdout","text":["Tue Oct 21 13:08:54 2025 \n","+-----------------------------------------------------------------------------------------+\n","| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |\n","|-----------------------------------------+------------------------+----------------------+\n","| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n","| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n","| | | MIG M. |\n","|=========================================+========================+======================|\n","| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |\n","| N/A 41C P8 16W / 72W | 3MiB / 23034MiB | 0% Default |\n","| | | N/A |\n","+-----------------------------------------+------------------------+----------------------+\n"," \n","+-----------------------------------------------------------------------------------------+\n","| Processes: |\n","| GPU GI CI PID Type Process name GPU Memory |\n","| ID ID Usage |\n","|=========================================================================================|\n","| No running processes found |\n","+-----------------------------------------------------------------------------------------+\n","\n","PyTorch version: 2.8.0+cu126\n","CUDA available: True\n","CUDA version: 12.6\n","GPU: NVIDIA L4\n","GPU Memory: 22.16 GB\n"]}],"source":["# Check GPU availability\n","!nvidia-smi\n","\n","import torch\n","print(f\"\\nPyTorch version: {torch.__version__}\")\n","print(f\"CUDA available: {torch.cuda.is_available()}\")\n","if torch.cuda.is_available():\n"," print(f\"CUDA version: {torch.version.cuda}\")\n"," print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n"," print(f\"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB\")"]},{"cell_type":"markdown","metadata":{"id":"clone-header"},"source":["## 2. Clone Repository"]},{"cell_type":"code","execution_count":2,"metadata":{"id":"clone-repo","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761051945593,"user_tz":-180,"elapsed":1715,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"04242349-7ab5-44a3-f54b-0e0838d77d1e"},"outputs":[{"output_type":"stream","name":"stdout","text":["Cloning into 'DeepSeek-OCR'...\n","remote: Enumerating objects: 34, done.\u001b[K\n","remote: Counting objects: 100% (4/4), done.\u001b[K\n","remote: Compressing objects: 100% (4/4), done.\u001b[K\n","remote: Total 34 (delta 0), reused 3 (delta 0), pack-reused 30 (from 1)\u001b[K\n","Receiving objects: 100% (34/34), 7.78 MiB | 17.63 MiB/s, done.\n","Resolving deltas: 100% (1/1), done.\n","/content/DeepSeek-OCR\n"]}],"source":["# Clone the DeepSeek-OCR repository\n","!git clone https://github.com/deepseek-ai/DeepSeek-OCR.git\n","%cd DeepSeek-OCR"]},{"cell_type":"markdown","metadata":{"id":"install-header"},"source":["## 3. Install Dependencies\n","\n","Installing PyTorch, transformers, and other required packages."]},{"cell_type":"code","execution_count":3,"metadata":{"id":"install-pytorch","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052048126,"user_tz":-180,"elapsed":6830,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"7df6bde0-d8e3-4296-8f32-4055b680235a"},"outputs":[{"output_type":"stream","name":"stdout","text":["Looking in indexes: https://download.pytorch.org/whl/cu118\n","Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (2.8.0+cu126)\n","Requirement already satisfied: torchvision in /usr/local/lib/python3.12/dist-packages (0.23.0+cu126)\n","Requirement already satisfied: torchaudio in /usr/local/lib/python3.12/dist-packages (2.8.0+cu126)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch) (3.20.0)\n","Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch) (4.15.0)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch) (75.2.0)\n","Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch) (1.13.3)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch) (3.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch) (3.1.6)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch) (2025.3.0)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.80)\n","Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch) (9.10.2.21)\n","Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.4.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch) (11.3.0.4)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch) (10.3.7.77)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch) (11.7.1.2)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch) (12.5.4.2)\n","Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch) (0.7.1)\n","Requirement already satisfied: nvidia-nccl-cu12==2.27.3 in /usr/local/lib/python3.12/dist-packages (from torch) (2.27.3)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.77)\n","Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.85)\n","Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch) (1.11.1.6)\n","Requirement already satisfied: triton==3.4.0 in /usr/local/lib/python3.12/dist-packages (from torch) (3.4.0)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from torchvision) (2.0.2)\n","Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.12/dist-packages (from torchvision) (11.3.0)\n","Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch) (1.3.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch) (3.0.3)\n"]}],"source":["# Install PyTorch with CUDA support (Colab typically has CUDA 11.8 or 12.1)\n","# Note: Colab may already have PyTorch installed, but we ensure compatible version\n","!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118"]},{"cell_type":"code","execution_count":4,"metadata":{"id":"install-requirements","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052066852,"user_tz":-180,"elapsed":18724,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"f5b37d6d-0019-4e44-d7dd-869270dc31bb"},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting transformers==4.46.3 (from -r requirements.txt (line 1))\n"," Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)\n","\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/44.1 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.1/44.1 kB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hCollecting tokenizers==0.20.3 (from -r requirements.txt (line 2))\n"," Downloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\n","Collecting PyMuPDF (from -r requirements.txt (line 3))\n"," Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)\n","Collecting img2pdf (from -r requirements.txt (line 4))\n"," Downloading img2pdf-0.6.1.tar.gz (106 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m106.5/106.5 kB\u001b[0m \u001b[31m11.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 5)) (0.8.1)\n","Requirement already satisfied: easydict in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 6)) (1.13)\n","Collecting addict (from -r requirements.txt (line 7))\n"," Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB)\n","Requirement already satisfied: Pillow in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 8)) (11.3.0)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 9)) (2.0.2)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (3.20.0)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (0.35.3)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (25.0)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (6.0.3)\n","Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (2024.11.6)\n","Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (2.32.4)\n","Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (0.6.2)\n","Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (4.67.1)\n","Collecting pikepdf (from img2pdf->-r requirements.txt (line 4))\n"," Downloading pikepdf-9.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.2 kB)\n","Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers==4.46.3->-r requirements.txt (line 1)) (2025.3.0)\n","Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers==4.46.3->-r requirements.txt (line 1)) (4.15.0)\n","Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers==4.46.3->-r requirements.txt (line 1)) (1.1.10)\n","Collecting Deprecated (from pikepdf->img2pdf->-r requirements.txt (line 4))\n"," Downloading Deprecated-1.2.18-py2.py3-none-any.whl.metadata (5.7 kB)\n","Requirement already satisfied: lxml>=4.8 in /usr/local/lib/python3.12/dist-packages (from pikepdf->img2pdf->-r requirements.txt (line 4)) (5.4.0)\n","Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (3.4.4)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (3.11)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (2.5.0)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (2025.10.5)\n","Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.12/dist-packages (from Deprecated->pikepdf->img2pdf->-r requirements.txt (line 4)) (1.17.3)\n","Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.0/10.0 MB\u001b[0m \u001b[31m129.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m67.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.1/24.1 MB\u001b[0m \u001b[31m101.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading addict-2.4.0-py3-none-any.whl (3.8 kB)\n","Downloading pikepdf-9.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (2.6 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.6/2.6 MB\u001b[0m \u001b[31m95.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading Deprecated-1.2.18-py2.py3-none-any.whl (10.0 kB)\n","Building wheels for collected packages: img2pdf\n"," Building wheel for img2pdf (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for img2pdf: filename=img2pdf-0.6.1-py3-none-any.whl size=51001 sha256=5898148565b9e8f8d7e2709de7f9503033d37bbc2b7ea4b749ef723e627f5c8f\n"," Stored in directory: /root/.cache/pip/wheels/a5/05/56/c05447973db749cd2178b8f95e36f007f0af5f5dce2c6197a5\n","Successfully built img2pdf\n","Installing collected packages: addict, PyMuPDF, Deprecated, pikepdf, tokenizers, img2pdf, transformers\n"," Attempting uninstall: tokenizers\n"," Found existing installation: tokenizers 0.22.1\n"," Uninstalling tokenizers-0.22.1:\n"," Successfully uninstalled tokenizers-0.22.1\n"," Attempting uninstall: transformers\n"," Found existing installation: transformers 4.57.1\n"," Uninstalling transformers-4.57.1:\n"," Successfully uninstalled transformers-4.57.1\n","Successfully installed Deprecated-1.2.18 PyMuPDF-1.26.5 addict-2.4.0 img2pdf-0.6.1 pikepdf-9.11.0 tokenizers-0.20.3 transformers-4.46.3\n"]}],"source":["# Install requirements from the repository\n","!pip install -r requirements.txt"]},{"cell_type":"code","execution_count":9,"metadata":{"id":"install-flash-attn","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052247111,"user_tz":-180,"elapsed":30354,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"6d507296-90f0-44b6-a06d-baba0ad73f89"},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting flash-attn==2.7.3\n"," Downloading flash_attn-2.7.3.tar.gz (3.2 MB)\n","\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/3.2 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m3.1/3.2 MB\u001b[0m \u001b[31m96.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.2/3.2 MB\u001b[0m \u001b[31m50.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (2.8.0+cu126)\n","Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (0.8.1)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.20.0)\n","Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (4.15.0)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (75.2.0)\n","Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.13.3)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.1.6)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2025.3.0)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.80)\n","Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (9.10.2.21)\n","Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.4.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.3.0.4)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (10.3.7.77)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.7.1.2)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.5.4.2)\n","Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (0.7.1)\n","Requirement already satisfied: nvidia-nccl-cu12==2.27.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2.27.3)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.85)\n","Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.11.1.6)\n","Requirement already satisfied: triton==3.4.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.4.0)\n","Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch->flash-attn==2.7.3) (1.3.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn==2.7.3) (3.0.3)\n","Building wheels for collected packages: flash-attn\n"," Building wheel for flash-attn (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for flash-attn: filename=flash_attn-2.7.3-cp312-cp312-linux_x86_64.whl size=414494788 sha256=567bddcae6f7c133fd964bed9988926fe7aabaddb58bf62a744b2f782a7d4269\n"," Stored in directory: /root/.cache/pip/wheels/f6/ba/3a/e5622e4a21e0735b65d5f7a0aca41c83467aaf2122031d214e\n","Successfully built flash-attn\n","Installing collected packages: flash-attn\n","Successfully installed flash-attn-2.7.3\n"]}],"source":["# Install flash-attention (this may take 5-10 minutes to compile)\n","!pip install flash-attn==2.7.3 --no-build-isolation"]},{"cell_type":"markdown","metadata":{"id":"upload-header"},"source":["## 4. Upload Test Image\n","\n","Upload your Capture.PNG file here."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"upload-image"},"outputs":[],"source":["from google.colab import files\n","from IPython.display import Image, display\n","import os\n","\n","# Upload the image\n","print(\"Please upload your Capture.PNG file:\")\n","uploaded = files.upload()\n","\n","# Get the uploaded filename\n","image_path = list(uploaded.keys())[0]\n","print(f\"\\nUploaded file: {image_path}\")\n","\n","# Display the uploaded image\n","print(\"\\nPreview of uploaded image:\")\n","display(Image(filename=image_path))"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"91486119","executionInfo":{"status":"ok","timestamp":1761052289487,"user_tz":-180,"elapsed":4318,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"fd63dddd-f81e-4600-e1a4-d16c643e28d4"},"source":["# Reinstall flash-attention with specific CUDA version\n","# Check your CUDA version with !nvidia-smi and adjust cu121 if necessary\n","!pip install flash-attn==2.7.3 --no-build-isolation --index-url https://download.pytorch.org/whl/cu121"],"execution_count":12,"outputs":[{"output_type":"stream","name":"stdout","text":["Looking in indexes: https://download.pytorch.org/whl/cu121\n","Requirement already satisfied: flash-attn==2.7.3 in /usr/local/lib/python3.12/dist-packages (2.7.3)\n","Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (2.8.0+cu126)\n","Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (0.8.1)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.20.0)\n","Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (4.15.0)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (75.2.0)\n","Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.13.3)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.1.6)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2025.3.0)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.80)\n","Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (9.10.2.21)\n","Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.4.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.3.0.4)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (10.3.7.77)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.7.1.2)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.5.4.2)\n","Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (0.7.1)\n","Requirement already satisfied: nvidia-nccl-cu12==2.27.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2.27.3)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.85)\n","Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.11.1.6)\n","Requirement already satisfied: triton==3.4.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.4.0)\n","Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch->flash-attn==2.7.3) (1.3.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn==2.7.3) (3.0.3)\n"]}]},{"cell_type":"markdown","metadata":{"id":"model-header"},"source":["## 5. Load DeepSeek-OCR Model\n","\n","This will download the model from HuggingFace (may take a few minutes)."]},{"cell_type":"code","execution_count":13,"metadata":{"id":"load-model","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052308534,"user_tz":-180,"elapsed":12795,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"512486e6-1a87-42f2-82f7-4a04093703a4"},"outputs":[{"output_type":"stream","name":"stdout","text":["Loading DeepSeek-OCR model...\n","This may take several minutes on first run...\n","\n"]},{"output_type":"stream","name":"stderr","text":["You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.\n","Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"output_type":"stream","name":"stdout","text":["Model loaded successfully!\n","Model device: cuda:0\n","Model dtype: torch.bfloat16\n"]}],"source":["from transformers import AutoModel, AutoTokenizer\n","import torch\n","import os\n","\n","print(\"Loading DeepSeek-OCR model...\")\n","print(\"This may take several minutes on first run...\\n\")\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","model_name = 'deepseek-ai/DeepSeek-OCR'\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n","# Removing attn_implementation='flash_attention_2' as a troubleshooting step\n","model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)\n","model = model.eval().cuda().to(torch.bfloat16)\n","\n","print(\"Model loaded successfully!\")\n","print(f\"Model device: {next(model.parameters()).device}\")\n","print(f\"Model dtype: {next(model.parameters()).dtype}\")"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"cabad6a9","executionInfo":{"status":"ok","timestamp":1761052382571,"user_tz":-180,"elapsed":10192,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"787ea67c-7d5a-4793-e662-7081561440b2"},"source":["from transformers import AutoModel, AutoTokenizer\n","import torch\n","import os\n","\n","print(\"Loading DeepSeek-OCR model...\")\n","print(\"This may take several minutes on first run...\\n\")\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","model_name = 'deepseek-ai/DeepSeek-OCR'\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n","# Removing attn_implementation='flash_attention_2' as a troubleshooting step\n","model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)\n","model = model.eval().cuda().to(torch.bfloat16)\n","\n","print(\"Model loaded successfully!\")\n","print(f\"Model device: {next(model.parameters()).device}\")\n","print(f\"Model dtype: {next(model.parameters()).dtype}\")"],"execution_count":16,"outputs":[{"output_type":"stream","name":"stdout","text":["Loading DeepSeek-OCR model...\n","This may take several minutes on first run...\n","\n"]},{"output_type":"stream","name":"stderr","text":["You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.\n","Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"output_type":"stream","name":"stdout","text":["Model loaded successfully!\n","Model device: cuda:0\n","Model dtype: torch.bfloat16\n"]}]},{"cell_type":"markdown","metadata":{"id":"inference-header"},"source":["## 6. Run OCR Inference"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"b902dbdc","executionInfo":{"status":"ok","timestamp":1761052448331,"user_tz":-180,"elapsed":44254,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"fc940e42-39d2-4cbb-d565-63afa61a505a"},"source":["from PIL import Image\n","import time\n","import os\n","import torch\n","\n","# Load the image (already loaded in a previous cell, but keeping this for clarity)\n","# img = Image.open(image_path)\n","# print(f\"Image size: {img.size}\")\n","# print(f\"Image mode: {img.mode}\\n\")\n","\n","# Set CUDA device (already set in model loading, but keeping for clarity)\n","# os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","\n","print(\"Running OCR inference using model.infer...\\n\")\n","start_time = time.time()\n","\n","# Define prompt and output path\n","# prompt = \"\\nFree OCR. \"\n","prompt = \"\\n<|grounding|>Convert the document to markdown. \"\n","output_path = '/content/ocr_output' # Define an output directory\n","\n","# Create output directory if it doesn't exist\n","if not os.path.exists(output_path):\n"," os.makedirs(output_path)\n","\n","# Run inference using the infer method\n","with torch.no_grad():\n"," # infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):\n","\n"," # Tiny: base_size = 512, image_size = 512, crop_mode = False\n"," # Small: base_size = 640, image_size = 640, crop_mode = False\n"," # Base: base_size = 1024, image_size = 1024, crop_mode = False\n"," # Large: base_size = 1280, image_size = 1280, crop_mode = False\n","\n"," # Gundam: base_size = 1024, image_size = 640, crop_mode = True\n","\n"," res = model.infer(tokenizer,\n"," prompt=prompt,\n"," image_file=image_path, # Use the uploaded image path\n"," output_path=output_path,\n"," base_size=1024,\n"," image_size=640,\n"," crop_mode=True,\n"," save_results=True,\n"," test_compress=True)\n","\n","end_time = time.time()\n","\n","print(f\"Inference completed in {end_time - start_time:.2f} seconds\\n\")\n","print(\"=\" * 80)\n","print(\"OCR RESULT:\")\n","print(\"=\" * 80)\n","# The infer method might return different formats,\n","# we will assume it returns the text directly or in a structure we can access.\n","# You might need to adjust this based on the actual output format of model.infer\n","print(res)\n","print(\"=\" * 80)\n","\n","# Note: The infer method with save_results=True should save the output to output_path\n","# You might need to adjust the saving and downloading logic in the next cell\n","# depending on how model.infer saves the results."],"execution_count":19,"outputs":[{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n"," warnings.warn(\n","The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n","Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"]},{"output_type":"stream","name":"stdout","text":["Running OCR inference using model.infer...\n","\n","=====================\n","BASE: torch.Size([1, 256, 1280])\n","PATCHES: torch.Size([4, 100, 1280])\n","=====================\n","<|ref|>text<|/ref|><|det|>[[62, 31, 483, 171]]<|/det|>\n","·We assess a wide range of state-of-the-art LLMs for the first time and empirically show that they exhibit significant patterns of bias related to non-binary gender representations, leaving room for future improvement. \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 198, 235, 225]]<|/det|>\n","## 2 Related Work \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 246, 373, 272]]<|/det|>\n","### 2.1 Binary Gender Bias in LLMs \n","\n","<|ref|>text<|/ref|><|det|>[[62, 283, 485, 992]]<|/det|>\n","Research on gender bias in artificial intelligence, especially in large language models (LLMs), has predominantly centered on binary gender categories, often reinforcing conventional stereotypes while overlooking the complexities of gender diversity (Blodgett et al., 2020; Nadeem et al., 2021; Schramowski et al., 2022; Stanovsky et al., 2019). Studies such as Bolukbasi et al. (2016) revealed that word embeddings trained in large corpora encode harmful gender stereotypes, associating men with technical roles and women with nurturing roles. Further research has demonstrated that LLMs often exhibit occupational gender bias, reinforcing male-dominated professions and associating women with domestic tasks (Zhao et al., 2018; Brown et al., 2020a; Wan et al., 2023; Ghosh and Caliskan, 2023; Chen et al., 2022). For example, Brown et al. (2020b) examined binary gender bias in GPT- 3 by prompting the model with phrases such as \"[He] was very\" and \"[She] was very\" and analyzing whether the adjectives and adverbs reflected gender stereotypes (e.g., \"handsome\" for men and \"beautiful\" for women). Chen et al. (2022) proposed a framework for measuring how LLMs reinforce gender stereotypes through role-based \n","\n","<|ref|>table<|/ref|><|det|>[[515, 24, 933, 355]]<|/det|>\n","\n","
Pronoun TypeNom.Acc.PossessiveRef.
Dep.Indep.
Binaryhehimhishishimself
sheherherhersherself
Neutraltheythemtheirtheirsthemself
Neothonthonthonsthonsthonself
eemesemsemself
aeaeraeraersaerself
cococoscoscoself
vivirvisvirsvirself
xexemxyrxyrxemself
eyemeireirsemself
zezirzirzirszirself
\n","\n","<|ref|>table_footnote<|/ref|><|det|>[[512, 375, 936, 421]]<|/det|>\n","Table 1: List of binary, gender-neutral, and neopronouns (Lauscher et al., 2022; Hossain et al., 2023). \n","\n","<|ref|>text<|/ref|><|det|>[[512, 448, 936, 985]]<|/det|>\n","communities. Blodgett et al. (2020) argued that many studies assessing bias in NLP systems lack grounding in real- world harms and do not adequately consider \"to whom\" these biases are harmful, particularly overlooking non- binary identities. Although datasets like StereoSet (Nadeem et al., 2021) and CrowS- Pairs (Nangia et al., 2020) have made progress in measuring stereotypical biases, they do not specifically address non- binary representation or experiences. Recent work has begun addressing this gap. You et al. (2024) explored name- based gender prediction with a \"neutral\" gender category. Hossain et al. (2023) introduced the MISGENDERED framework, evaluating LLMs on their use of gender- neutral pronouns and neopronouns. Similarly, Ovalle et al. (2023) examined how LLMs misgender transgender and non- binary (TGNB) individuals, revealing that binary norms dominate AI behavior and showing LLMs are less\n","==================================================\n","image size: (871, 784)\n","valid image tokens: 630\n","output texts tokens (valid): 1038\n","compression ratio: 1.65\n","==================================================\n","===============save results:===============\n"]},{"output_type":"stream","name":"stderr","text":["image: 0it [00:00, ?it/s]\n","other: 100%|██████████| 7/7 [00:00<00:00, 41352.29it/s]"]},{"output_type":"stream","name":"stdout","text":["Inference completed in 44.24 seconds\n","\n","================================================================================\n","OCR RESULT:\n","================================================================================\n","None\n","================================================================================\n"]},{"output_type":"stream","name":"stderr","text":["\n"]}]},{"cell_type":"markdown","metadata":{"id":"batch-header"},"source":["## 8. Batch Processing (Optional)\n","\n","Process multiple images at once."]},{"cell_type":"code","execution_count":27,"metadata":{"id":"batch-process","colab":{"base_uri":"https://localhost:8080/","height":1000},"executionInfo":{"status":"ok","timestamp":1761053029367,"user_tz":-180,"elapsed":99783,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"0b15fc8e-e16f-4dcb-cf3a-3a63a5901b03"},"outputs":[{"output_type":"stream","name":"stdout","text":["Upload multiple images for batch processing:\n"]},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["\n"," \n"," \n"," Upload widget is only available when the cell has been executed in the\n"," current browser session. Please rerun this cell to enable.\n"," \n"," "]},"metadata":{}},{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n"," warnings.warn(\n","The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n","Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"]},{"output_type":"stream","name":"stdout","text":["Saving Capture.jpg to Capture (5).jpg\n","Saving Capture1.jpg to Capture1 (1).jpg\n","\n","Processing Capture (5).jpg...\n","=====================\n","BASE: torch.Size([1, 256, 1280])\n","PATCHES: torch.Size([4, 100, 1280])\n","=====================\n","<|ref|>text<|/ref|><|det|>[[62, 31, 483, 171]]<|/det|>\n","·We assess a wide range of state-of-the-art LLMs for the first time and empirically show that they exhibit significant patterns of bias related to non-binary gender representations, leaving room for future improvement. \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 198, 235, 225]]<|/det|>\n","## 2 Related Work \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 246, 373, 272]]<|/det|>\n","### 2.1 Binary Gender Bias in LLMs \n","\n","<|ref|>text<|/ref|><|det|>[[62, 283, 485, 992]]<|/det|>\n","Research on gender bias in artificial intelligence, especially in large language models (LLMs), has predominantly centered on binary gender categories, often reinforcing conventional stereotypes while overlooking the complexities of gender diversity (Blodgett et al., 2020; Nadeem et al., 2021; Schramowski et al., 2022; Stanovsky et al., 2019). Studies such as Bolukbasi et al. (2016) revealed that word embeddings trained in large corpora encode harmful gender stereotypes, associating men with technical roles and women with nurturing roles. Further research has demonstrated that LLMs often exhibit occupational gender bias, reinforcing male-dominated professions and associating women with domestic tasks (Zhao et al., 2018; Brown et al., 2020a; Wan et al., 2023; Ghosh and Caliskan, 2023; Chen et al., 2022). For example, Brown et al. (2020b) examined binary gender bias in GPT- 3 by prompting the model with phrases such as \"[He] was very\" and \"[She] was very\" and analyzing whether the adjectives and adverbs reflected gender stereotypes (e.g., \"handsome\" for men and \"beautiful\" for women). Chen et al. (2022) proposed a framework for measuring how LLMs reinforce gender stereotypes through role-based \n","\n","<|ref|>table<|/ref|><|det|>[[515, 24, 933, 355]]<|/det|>\n","\n","
Pronoun TypeNom.Acc.PossessiveRef.
Dep.Indep.
Binaryhehimhishishimself
sheherherhersherself
Neutraltheythemtheirtheirsthemself
Neothonthonthonsthonsthonself
eemesemsemself
aeaeraeraersaerself
cococoscoscoself
vivirvisvirsvirself
xexemxyrxyrxemself
eyemeireirsemself
zezirzirzirszirself
\n","\n","<|ref|>table_footnote<|/ref|><|det|>[[512, 375, 936, 421]]<|/det|>\n","Table 1: List of binary, gender-neutral, and neopronouns (Lauscher et al., 2022; Hossain et al., 2023). \n","\n","<|ref|>text<|/ref|><|det|>[[512, 448, 936, 985]]<|/det|>\n","communities. Blodgett et al. (2020) argued that many studies assessing bias in NLP systems lack grounding in real- world harms and do not adequately consider \"to whom\" these biases are harmful, particularly overlooking non- binary identities. Although datasets like StereoSet (Nadeem et al., 2021) and CrowS- Pairs (Nangia et al., 2020) have made progress in measuring stereotypical biases, they do not specifically address non- binary representation or experiences. Recent work has begun addressing this gap. You et al. (2024) explored name- based gender prediction with a \"neutral\" gender category. Hossain et al. (2023) introduced the MISGENDERED framework, evaluating LLMs on their use of gender- neutral pronouns and neopronouns. Similarly, Ovalle et al. (2023) examined how LLMs misgender transgender and non- binary (TGNB) individuals, revealing that binary norms dominate AI behavior and showing LLMs are less\n","==================================================\n","image size: (871, 784)\n","valid image tokens: 630\n","output texts tokens (valid): 1038\n","compression ratio: 1.65\n","==================================================\n","===============save results:===============\n"]},{"output_type":"stream","name":"stderr","text":["image: 0it [00:00, ?it/s]\n","other: 100%|██████████| 7/7 [00:00<00:00, 68279.37it/s]"]},{"output_type":"stream","name":"stdout","text":["✓ Capture (5).jpg processed successfully. Output saved to /content/batch_ocr_output\n","\n","Processing Capture1 (1).jpg...\n"]},{"output_type":"stream","name":"stderr","text":["\n","The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n","Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"]},{"output_type":"stream","name":"stdout","text":["=====================\n","BASE: torch.Size([1, 256, 1280])\n","PATCHES: torch.Size([4, 100, 1280])\n","=====================\n","<|ref|>text<|/ref|><|det|>[[20, 0, 475, 338]]<|/det|>\n","Retrieval Augmented Generation system for LLM agents. SCMRAG introduces a novel paradigm that moves beyond the static retrieval methods of traditional RAG systems by integrating a dynamic, LLM- assisted knowledge graph for information retrieval. This knowledge graph evolves with the system, updating and refining itself based on the SCMRAG's agent driven interactions and query- answer pair generations. Crucially, SCMRAG also includes a self- corrective mechanism, enabling it to identify when information is missing or inadequate and autonomously retrieves it from external sources (e.g. web, enterprise information sources, or any other available information resources) by generating a new retrieval query without relying on predefined algorithms. This self- corrective step ensures that up- to- date and accurate information is always accessible. \n","\n","<|ref|>text<|/ref|><|det|>[[20, 338, 475, 500]]<|/det|>\n","Another key feature of SCMRAG is its LLM agent driven internal reasoning agent. It gives the system the decision- making capability to determine whether the knowledge graph contains sufficient information to answer a query or whether a corrective step is necessary to enhance the retrieval process. It enables SCMRAG to adapt to a wide range of tasks and domains while minimizing hallucinations. \n","\n","<|ref|>text<|/ref|><|det|>[[20, 504, 475, 602]]<|/det|>\n","SCMRAG ensures that only the most relevant content is retrieved from available data sources, even when the knowledge base is incomplete or outdated. The key contributions of our proposed method are as follows: \n","\n","<|ref|>text<|/ref|><|det|>[[42, 618, 475, 978]]<|/det|>\n","1. We introduce a novel RAG paradigm that employs a dynamic, self-updating knowledge graph to guide multihop retrieval, allowing for more context-aware and accurate information retrieval. \n","2. We propose a self-corrective, agent-driven mechanism that enables SCMRAG to autonomously update missing or outdated information by fetching data from external sources. \n","3. We achieve state-of-the-art performance on four datasets, even when using a quantized LLM with significantly fewer parameters. Notably, these results are obtained without any LLM fine-tuning. \n","4. We demonstrate that SCMRAG's advanced reasoning capabilities significantly reduce hallucinations by ensuring that only the most relevant and accurate information is provided to the LLM for generation. \n","\n","<|ref|>text<|/ref|><|det|>[[520, 0, 828, 20]]<|/det|>\n","pretraining with vast amounts of knowledge. \n","\n","<|ref|>sub_title<|/ref|><|det|>[[518, 55, 680, 80]]<|/det|>\n","### 2.1 Initial Work \n","\n","<|ref|>text<|/ref|><|det|>[[518, 85, 972, 384]]<|/det|>\n","Works such as the RAG model proposed by Lewis et al. [13] were instrumental in showing that augmenting a generation model with a retrieval step could greatly improve the factual correctness of AI- generated text. Lewis et al. introduced the two- stage RAG process, where a retriever is responsible for fetching relevant documents based on a query, and a generator produces text conditioned on these retrieved documents. This approach was proven to outperform purely generative or purely extractive models in tasks such as knowledge- based QA and passage generation. This dual system ensures that the model's output is grounded in real- world data. It highlighted the importance of coupling retrieval systems with LLMs to enhance performance in open- domain tasks. \n","\n","<|ref|>sub_title<|/ref|><|det|>[[518, 410, 849, 436]]<|/det|>\n","### 2.2 Advances in RAG Architecture \n","\n","<|ref|>text<|/ref|><|det|>[[518, 441, 972, 616]]<|/det|>\n","Several models have introduced further innovations to improve the efficiency and accuracy of retrieval mechanisms. Karpukhin et al. developed Dense Passage Retrieval (DPR) [9], a technique that leverages dense vector representations for more accurate retrieval of semantically relevant passages. DPR became foundational in improving the retriever's ability to return highly relevant documents from vast corpora. \n","\n","<|ref|>text<|/ref|><|det|>[[518, 618, 972, 789]]<|/det|>\n","Later advancements in RAG systems sought to optimize both the retrieval and generation phases. Fusion- in- Decoder [8] integrated multiple retrieved documents simultaneously within the decoder, allowing the model to generate answers that more holistically synthesized information from various sources. This method allowed for more contextual outputs, and was effective in handling multi- hop questions requiring reasoning across multiple documents. \n","\n","<|ref|>text<|/ref|><|det|>[[518, 792, 972, 960]]<|/det|>\n","A critical issue with these approaches is the reliance on static retrieval corpora, which limits the system's ability to access up- to- date information, leading to outdated or incomplete responses in rapidly evolving domains. Moreover, the retriever and generator components in transformer based RAG models are generally trained separately. This often leads to mismatches between retrieved documents and generated content.\n","==================================================\n","image size: (887, 719)\n","valid image tokens: 607\n","output texts tokens (valid): 1018\n","compression ratio: 1.68\n","==================================================\n","===============save results:===============\n"]},{"output_type":"stream","name":"stderr","text":["image: 0it [00:00, ?it/s]\n","other: 100%|██████████| 11/11 [00:00<00:00, 72657.23it/s]"]},{"output_type":"stream","name":"stdout","text":["✓ Capture1 (1).jpg processed successfully. Output saved to /content/batch_ocr_output\n","\n","================================================================================\n","BATCH PROCESSING SUMMARY\n","================================================================================\n","\n","--- Capture (5).jpg ---\n","Processed. Output saved to /content/batch_ocr_output\n","\n","\n","--- Capture1 (1).jpg ---\n","Processed. Output saved to /content/batch_ocr_output\n","\n","\n","Detailed results are saved in the directory: /content/batch_ocr_output\n"]},{"output_type":"stream","name":"stderr","text":["\n"]}],"source":["from PIL import Image\n","import time\n","import os\n","import torch\n","\n","# Upload multiple images\n","print(\"Upload multiple images for batch processing:\")\n","uploaded_files = files.upload()\n","\n","results = {}\n","output_path = '/content/batch_ocr_output' # Define a directory for batch output\n","\n","# Create output directory if it doesn't exist\n","if not os.path.exists(output_path):\n"," os.makedirs(output_path)\n","\n","for filename in uploaded_files.keys():\n"," print(f\"\\nProcessing {filename}...\")\n","\n"," try:\n"," # Construct the full image path in the current working directory\n"," image_path = os.path.join(os.getcwd(), filename)\n","\n"," # Define prompt (adjust based on DeepSeek-OCR's expected format)\n"," prompt = \"\\n<|grounding|>Convert the document to markdown. \"\n","\n"," with torch.no_grad():\n"," # Use the infer method for batch processing\n"," res = model.infer(tokenizer,\n"," prompt=prompt,\n"," image_file=image_path, # Use the uploaded image path\n"," output_path=output_path,\n"," base_size=1024,\n"," image_size=640,\n"," crop_mode=True,\n"," save_results=True,\n"," test_compress=True)\n","\n"," # The infer method with save_results=True saves the output to output_path\n"," # You might need to adjust how to retrieve or confirm the saved result\n"," # For this example, we'll just note that it was processed.\n"," results[filename] = f\"Processed. Output saved to {output_path}\"\n"," print(f\"✓ {filename} processed successfully. Output saved to {output_path}\")\n","\n"," except Exception as e:\n"," print(f\"✗ Error processing {filename}: {str(e)}\")\n"," results[filename] = f\"Error: {str(e)}\"\n","\n","# Display all results (or confirmation of processing)\n","print(\"\\n\" + \"=\" * 80)\n","print(\"BATCH PROCESSING SUMMARY\")\n","print(\"=\" * 80)\n","\n","for filename, result in results.items():\n"," print(f\"\\n--- {filename} ---\")\n"," print(result)\n"," print()\n","\n","print(f\"\\nDetailed results are saved in the directory: {output_path}\")\n","\n","# Note: Downloading the batch results as a single file might require\n","# zipping the output directory or iterating through saved files.\n","# This part is commented out as model.infer handles saving.\n","# with open('batch_results.txt', 'w', encoding='utf-8') as f:\n","# for filename, result in results.items():\n","# f.write(f\"{'='*80}\\n\")\n","# f.write(f\"File: {filename}\\n\")\n","# f.write(f\"{'='*80}\\n\")\n","# f.write(result)\n","# f.write(f\"\\n\\n\")\n","#\n","# files.download('batch_results.txt')"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"b533ce66","executionInfo":{"status":"ok","timestamp":1761052768625,"user_tz":-180,"elapsed":10037,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"ad79b0f6-cbc4-46cd-fce1-9d542a0d3af4"},"source":["from transformers import AutoModel, AutoTokenizer\n","import torch\n","import os\n","\n","print(\"Loading DeepSeek-OCR model for batch processing...\")\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","model_name = 'deepseek-ai/DeepSeek-OCR'\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n","# Removing attn_implementation='flash_attention_2' as a troubleshooting step\n","model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)\n","model = model.eval().cuda().to(torch.bfloat16)\n","\n","print(\"Model loaded successfully for batch processing!\")"],"execution_count":23,"outputs":[{"output_type":"stream","name":"stdout","text":["Loading DeepSeek-OCR model for batch processing...\n"]},{"output_type":"stream","name":"stderr","text":["You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.\n","Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"output_type":"stream","name":"stdout","text":["Model loaded successfully for batch processing!\n"]}]},{"cell_type":"markdown","metadata":{"id":"troubleshooting-header"},"source":["## Troubleshooting\n","\n","### Common Issues:\n","\n","1. **Out of Memory (OOM):**\n"," - Use a higher-tier GPU (A100, V100)\n"," - Reduce image resolution before processing\n"," - Enable gradient checkpointing\n","\n","2. **Flash Attention Installation Fails:**\n"," - Try removing `attn_implementation='flash_attention_2'` parameter\n"," - Fallback to standard attention mechanism\n","\n","3. **Model Download Slow:**\n"," - This is normal for large models (may take 10-15 minutes)\n"," - Model is cached after first download\n","\n","4. **Image Format Issues:**\n"," - Ensure image is in RGB format\n"," - Convert: `img = img.convert('RGB')`\n","\n","### Performance Tips:\n","\n","- Use images close to native resolutions: 512×512, 640×640, 1024×1024, 1280×1280\n","- For faster inference, use `torch.float16` (already enabled)\n","- Batch processing is more efficient for multiple images"]},{"cell_type":"markdown","metadata":{"id":"cleanup-header"},"source":["## Cleanup (Optional)\n","\n","Free up GPU memory when done."]},{"cell_type":"code","execution_count":21,"metadata":{"id":"cleanup","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052484637,"user_tz":-180,"elapsed":178,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"80686745-04cb-4a6f-f2a6-f0ef2b5afa66"},"outputs":[{"output_type":"stream","name":"stdout","text":["GPU memory cleared\n"]}],"source":["# Clear GPU memory\n","import gc\n","\n","del model\n","del tokenizer\n","gc.collect()\n","torch.cuda.empty_cache()\n","\n","print(\"GPU memory cleared\")"]}],"metadata":{"accelerator":"GPU","colab":{"gpuType":"L4","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.0"}},"nbformat":4,"nbformat_minor":0}