{"cells":[{"cell_type":"markdown","metadata":{"id":"header"},"source":["# DeepSeek-OCR on Google Colab\n","\n","This notebook sets up and runs the DeepSeek-OCR model for optical character recognition.\n","\n","**Requirements:**\n","- GPU Runtime (T4 or better recommended)\n","- ~15-20 minutes setup time\n","\n","**Based on:** https://github.com/deepseek-ai/DeepSeek-OCR"]},{"cell_type":"markdown","metadata":{"id":"setup-header"},"source":["## 1. Environment Setup and GPU Check"]},{"cell_type":"code","execution_count":6,"metadata":{"id":"gpu-check","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052135367,"user_tz":-180,"elapsed":209,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"b7b51fa6-6b80-47bb-9b9d-bd8c45de233f"},"outputs":[{"output_type":"stream","name":"stdout","text":["Tue Oct 21 13:08:54 2025       \n","+-----------------------------------------------------------------------------------------+\n","| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |\n","|-----------------------------------------+------------------------+----------------------+\n","| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |\n","| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |\n","|                                         |                        |               MIG M. |\n","|=========================================+========================+======================|\n","|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |\n","| N/A   41C    P8             16W /   72W |       3MiB /  23034MiB |      0%      Default |\n","|                                         |                        |                  N/A |\n","+-----------------------------------------+------------------------+----------------------+\n","                                                                                         \n","+-----------------------------------------------------------------------------------------+\n","| Processes:                                                                              |\n","|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |\n","|        ID   ID                                                               Usage      |\n","|=========================================================================================|\n","|  No running processes found                                                             |\n","+-----------------------------------------------------------------------------------------+\n","\n","PyTorch version: 2.8.0+cu126\n","CUDA available: True\n","CUDA version: 12.6\n","GPU: NVIDIA L4\n","GPU Memory: 22.16 GB\n"]}],"source":["# Check GPU availability\n","!nvidia-smi\n","\n","import torch\n","print(f\"\\nPyTorch version: {torch.__version__}\")\n","print(f\"CUDA available: {torch.cuda.is_available()}\")\n","if torch.cuda.is_available():\n","    print(f\"CUDA version: {torch.version.cuda}\")\n","    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n","    print(f\"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB\")"]},{"cell_type":"markdown","metadata":{"id":"clone-header"},"source":["## 2. Clone Repository"]},{"cell_type":"code","execution_count":2,"metadata":{"id":"clone-repo","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761051945593,"user_tz":-180,"elapsed":1715,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"04242349-7ab5-44a3-f54b-0e0838d77d1e"},"outputs":[{"output_type":"stream","name":"stdout","text":["Cloning into 'DeepSeek-OCR'...\n","remote: Enumerating objects: 34, done.\u001b[K\n","remote: Counting objects: 100% (4/4), done.\u001b[K\n","remote: Compressing objects: 100% (4/4), done.\u001b[K\n","remote: Total 34 (delta 0), reused 3 (delta 0), pack-reused 30 (from 1)\u001b[K\n","Receiving objects: 100% (34/34), 7.78 MiB | 17.63 MiB/s, done.\n","Resolving deltas: 100% (1/1), done.\n","/content/DeepSeek-OCR\n"]}],"source":["# Clone the DeepSeek-OCR repository\n","!git clone https://github.com/deepseek-ai/DeepSeek-OCR.git\n","%cd DeepSeek-OCR"]},{"cell_type":"markdown","metadata":{"id":"install-header"},"source":["## 3. Install Dependencies\n","\n","Installing PyTorch, transformers, and other required packages."]},{"cell_type":"code","execution_count":3,"metadata":{"id":"install-pytorch","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052048126,"user_tz":-180,"elapsed":6830,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"7df6bde0-d8e3-4296-8f32-4055b680235a"},"outputs":[{"output_type":"stream","name":"stdout","text":["Looking in indexes: https://download.pytorch.org/whl/cu118\n","Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (2.8.0+cu126)\n","Requirement already satisfied: torchvision in /usr/local/lib/python3.12/dist-packages (0.23.0+cu126)\n","Requirement already satisfied: torchaudio in /usr/local/lib/python3.12/dist-packages (2.8.0+cu126)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch) (3.20.0)\n","Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch) (4.15.0)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch) (75.2.0)\n","Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch) (1.13.3)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch) (3.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch) (3.1.6)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch) (2025.3.0)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.80)\n","Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch) (9.10.2.21)\n","Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.4.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch) (11.3.0.4)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch) (10.3.7.77)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch) (11.7.1.2)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch) (12.5.4.2)\n","Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch) (0.7.1)\n","Requirement already satisfied: nvidia-nccl-cu12==2.27.3 in /usr/local/lib/python3.12/dist-packages (from torch) (2.27.3)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.77)\n","Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch) (12.6.85)\n","Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch) (1.11.1.6)\n","Requirement already satisfied: triton==3.4.0 in /usr/local/lib/python3.12/dist-packages (from torch) (3.4.0)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from torchvision) (2.0.2)\n","Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.12/dist-packages (from torchvision) (11.3.0)\n","Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch) (1.3.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch) (3.0.3)\n"]}],"source":["# Install PyTorch with CUDA support (Colab typically has CUDA 11.8 or 12.1)\n","# Note: Colab may already have PyTorch installed, but we ensure compatible version\n","!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118"]},{"cell_type":"code","execution_count":4,"metadata":{"id":"install-requirements","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052066852,"user_tz":-180,"elapsed":18724,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"f5b37d6d-0019-4e44-d7dd-869270dc31bb"},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting transformers==4.46.3 (from -r requirements.txt (line 1))\n","  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)\n","\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/44.1 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.1/44.1 kB\u001b[0m \u001b[31m3.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hCollecting tokenizers==0.20.3 (from -r requirements.txt (line 2))\n","  Downloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\n","Collecting PyMuPDF (from -r requirements.txt (line 3))\n","  Downloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)\n","Collecting img2pdf (from -r requirements.txt (line 4))\n","  Downloading img2pdf-0.6.1.tar.gz (106 kB)\n","\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m106.5/106.5 kB\u001b[0m \u001b[31m11.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 5)) (0.8.1)\n","Requirement already satisfied: easydict in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 6)) (1.13)\n","Collecting addict (from -r requirements.txt (line 7))\n","  Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB)\n","Requirement already satisfied: Pillow in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 8)) (11.3.0)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 9)) (2.0.2)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (3.20.0)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (0.35.3)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (25.0)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (6.0.3)\n","Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (2024.11.6)\n","Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (2.32.4)\n","Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (0.6.2)\n","Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers==4.46.3->-r requirements.txt (line 1)) (4.67.1)\n","Collecting pikepdf (from img2pdf->-r requirements.txt (line 4))\n","  Downloading pikepdf-9.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.2 kB)\n","Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers==4.46.3->-r requirements.txt (line 1)) (2025.3.0)\n","Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers==4.46.3->-r requirements.txt (line 1)) (4.15.0)\n","Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.23.2->transformers==4.46.3->-r requirements.txt (line 1)) (1.1.10)\n","Collecting Deprecated (from pikepdf->img2pdf->-r requirements.txt (line 4))\n","  Downloading Deprecated-1.2.18-py2.py3-none-any.whl.metadata (5.7 kB)\n","Requirement already satisfied: lxml>=4.8 in /usr/local/lib/python3.12/dist-packages (from pikepdf->img2pdf->-r requirements.txt (line 4)) (5.4.0)\n","Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (3.4.4)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (3.11)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (2.5.0)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==4.46.3->-r requirements.txt (line 1)) (2025.10.5)\n","Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.12/dist-packages (from Deprecated->pikepdf->img2pdf->-r requirements.txt (line 4)) (1.17.3)\n","Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)\n","\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m10.0/10.0 MB\u001b[0m \u001b[31m129.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)\n","\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m67.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)\n","\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.1/24.1 MB\u001b[0m \u001b[31m101.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading addict-2.4.0-py3-none-any.whl (3.8 kB)\n","Downloading pikepdf-9.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (2.6 MB)\n","\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.6/2.6 MB\u001b[0m \u001b[31m95.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hDownloading Deprecated-1.2.18-py2.py3-none-any.whl (10.0 kB)\n","Building wheels for collected packages: img2pdf\n","  Building wheel for img2pdf (setup.py) ... \u001b[?25l\u001b[?25hdone\n","  Created wheel for img2pdf: filename=img2pdf-0.6.1-py3-none-any.whl size=51001 sha256=5898148565b9e8f8d7e2709de7f9503033d37bbc2b7ea4b749ef723e627f5c8f\n","  Stored in directory: /root/.cache/pip/wheels/a5/05/56/c05447973db749cd2178b8f95e36f007f0af5f5dce2c6197a5\n","Successfully built img2pdf\n","Installing collected packages: addict, PyMuPDF, Deprecated, pikepdf, tokenizers, img2pdf, transformers\n","  Attempting uninstall: tokenizers\n","    Found existing installation: tokenizers 0.22.1\n","    Uninstalling tokenizers-0.22.1:\n","      Successfully uninstalled tokenizers-0.22.1\n","  Attempting uninstall: transformers\n","    Found existing installation: transformers 4.57.1\n","    Uninstalling transformers-4.57.1:\n","      Successfully uninstalled transformers-4.57.1\n","Successfully installed Deprecated-1.2.18 PyMuPDF-1.26.5 addict-2.4.0 img2pdf-0.6.1 pikepdf-9.11.0 tokenizers-0.20.3 transformers-4.46.3\n"]}],"source":["# Install requirements from the repository\n","!pip install -r requirements.txt"]},{"cell_type":"code","execution_count":9,"metadata":{"id":"install-flash-attn","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052247111,"user_tz":-180,"elapsed":30354,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"6d507296-90f0-44b6-a06d-baba0ad73f89"},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting flash-attn==2.7.3\n","  Downloading flash_attn-2.7.3.tar.gz (3.2 MB)\n","\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/3.2 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m \u001b[32m3.1/3.2 MB\u001b[0m \u001b[31m96.2 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.2/3.2 MB\u001b[0m \u001b[31m50.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (2.8.0+cu126)\n","Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (0.8.1)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.20.0)\n","Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (4.15.0)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (75.2.0)\n","Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.13.3)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.1.6)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2025.3.0)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.80)\n","Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (9.10.2.21)\n","Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.4.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.3.0.4)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (10.3.7.77)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.7.1.2)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.5.4.2)\n","Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (0.7.1)\n","Requirement already satisfied: nvidia-nccl-cu12==2.27.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2.27.3)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.85)\n","Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.11.1.6)\n","Requirement already satisfied: triton==3.4.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.4.0)\n","Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch->flash-attn==2.7.3) (1.3.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn==2.7.3) (3.0.3)\n","Building wheels for collected packages: flash-attn\n","  Building wheel for flash-attn (setup.py) ... \u001b[?25l\u001b[?25hdone\n","  Created wheel for flash-attn: filename=flash_attn-2.7.3-cp312-cp312-linux_x86_64.whl size=414494788 sha256=567bddcae6f7c133fd964bed9988926fe7aabaddb58bf62a744b2f782a7d4269\n","  Stored in directory: /root/.cache/pip/wheels/f6/ba/3a/e5622e4a21e0735b65d5f7a0aca41c83467aaf2122031d214e\n","Successfully built flash-attn\n","Installing collected packages: flash-attn\n","Successfully installed flash-attn-2.7.3\n"]}],"source":["# Install flash-attention (this may take 5-10 minutes to compile)\n","!pip install flash-attn==2.7.3 --no-build-isolation"]},{"cell_type":"markdown","metadata":{"id":"upload-header"},"source":["## 4. Upload Test Image\n","\n","Upload your Capture.PNG file here."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"upload-image"},"outputs":[],"source":["from google.colab import files\n","from IPython.display import Image, display\n","import os\n","\n","# Upload the image\n","print(\"Please upload your Capture.PNG file:\")\n","uploaded = files.upload()\n","\n","# Get the uploaded filename\n","image_path = list(uploaded.keys())[0]\n","print(f\"\\nUploaded file: {image_path}\")\n","\n","# Display the uploaded image\n","print(\"\\nPreview of uploaded image:\")\n","display(Image(filename=image_path))"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"91486119","executionInfo":{"status":"ok","timestamp":1761052289487,"user_tz":-180,"elapsed":4318,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"fd63dddd-f81e-4600-e1a4-d16c643e28d4"},"source":["# Reinstall flash-attention with specific CUDA version\n","# Check your CUDA version with !nvidia-smi and adjust cu121 if necessary\n","!pip install flash-attn==2.7.3 --no-build-isolation --index-url https://download.pytorch.org/whl/cu121"],"execution_count":12,"outputs":[{"output_type":"stream","name":"stdout","text":["Looking in indexes: https://download.pytorch.org/whl/cu121\n","Requirement already satisfied: flash-attn==2.7.3 in /usr/local/lib/python3.12/dist-packages (2.7.3)\n","Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (2.8.0+cu126)\n","Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn==2.7.3) (0.8.1)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.20.0)\n","Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (4.15.0)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (75.2.0)\n","Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.13.3)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.5)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.1.6)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2025.3.0)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.80)\n","Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (9.10.2.21)\n","Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.4.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.3.0.4)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (10.3.7.77)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (11.7.1.2)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.5.4.2)\n","Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (0.7.1)\n","Requirement already satisfied: nvidia-nccl-cu12==2.27.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (2.27.3)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.77)\n","Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (12.6.85)\n","Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (1.11.1.6)\n","Requirement already satisfied: triton==3.4.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn==2.7.3) (3.4.0)\n","Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch->flash-attn==2.7.3) (1.3.0)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn==2.7.3) (3.0.3)\n"]}]},{"cell_type":"markdown","metadata":{"id":"model-header"},"source":["## 5. Load DeepSeek-OCR Model\n","\n","This will download the model from HuggingFace (may take a few minutes)."]},{"cell_type":"code","execution_count":13,"metadata":{"id":"load-model","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052308534,"user_tz":-180,"elapsed":12795,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"512486e6-1a87-42f2-82f7-4a04093703a4"},"outputs":[{"output_type":"stream","name":"stdout","text":["Loading DeepSeek-OCR model...\n","This may take several minutes on first run...\n","\n"]},{"output_type":"stream","name":"stderr","text":["You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.\n","Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"output_type":"stream","name":"stdout","text":["Model loaded successfully!\n","Model device: cuda:0\n","Model dtype: torch.bfloat16\n"]}],"source":["from transformers import AutoModel, AutoTokenizer\n","import torch\n","import os\n","\n","print(\"Loading DeepSeek-OCR model...\")\n","print(\"This may take several minutes on first run...\\n\")\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","model_name = 'deepseek-ai/DeepSeek-OCR'\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n","# Removing attn_implementation='flash_attention_2' as a troubleshooting step\n","model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)\n","model = model.eval().cuda().to(torch.bfloat16)\n","\n","print(\"Model loaded successfully!\")\n","print(f\"Model device: {next(model.parameters()).device}\")\n","print(f\"Model dtype: {next(model.parameters()).dtype}\")"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"cabad6a9","executionInfo":{"status":"ok","timestamp":1761052382571,"user_tz":-180,"elapsed":10192,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"787ea67c-7d5a-4793-e662-7081561440b2"},"source":["from transformers import AutoModel, AutoTokenizer\n","import torch\n","import os\n","\n","print(\"Loading DeepSeek-OCR model...\")\n","print(\"This may take several minutes on first run...\\n\")\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","model_name = 'deepseek-ai/DeepSeek-OCR'\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n","# Removing attn_implementation='flash_attention_2' as a troubleshooting step\n","model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)\n","model = model.eval().cuda().to(torch.bfloat16)\n","\n","print(\"Model loaded successfully!\")\n","print(f\"Model device: {next(model.parameters()).device}\")\n","print(f\"Model dtype: {next(model.parameters()).dtype}\")"],"execution_count":16,"outputs":[{"output_type":"stream","name":"stdout","text":["Loading DeepSeek-OCR model...\n","This may take several minutes on first run...\n","\n"]},{"output_type":"stream","name":"stderr","text":["You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.\n","Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"output_type":"stream","name":"stdout","text":["Model loaded successfully!\n","Model device: cuda:0\n","Model dtype: torch.bfloat16\n"]}]},{"cell_type":"markdown","metadata":{"id":"inference-header"},"source":["## 6. Run OCR Inference"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"b902dbdc","executionInfo":{"status":"ok","timestamp":1761052448331,"user_tz":-180,"elapsed":44254,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"fc940e42-39d2-4cbb-d565-63afa61a505a"},"source":["from PIL import Image\n","import time\n","import os\n","import torch\n","\n","# Load the image (already loaded in a previous cell, but keeping this for clarity)\n","# img = Image.open(image_path)\n","# print(f\"Image size: {img.size}\")\n","# print(f\"Image mode: {img.mode}\\n\")\n","\n","# Set CUDA device (already set in model loading, but keeping for clarity)\n","# os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","\n","print(\"Running OCR inference using model.infer...\\n\")\n","start_time = time.time()\n","\n","# Define prompt and output path\n","# prompt = \"<image>\\nFree OCR. \"\n","prompt = \"<image>\\n<|grounding|>Convert the document to markdown. \"\n","output_path = '/content/ocr_output' # Define an output directory\n","\n","# Create output directory if it doesn't exist\n","if not os.path.exists(output_path):\n","    os.makedirs(output_path)\n","\n","# Run inference using the infer method\n","with torch.no_grad():\n","    # infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):\n","\n","    # Tiny: base_size = 512, image_size = 512, crop_mode = False\n","    # Small: base_size = 640, image_size = 640, crop_mode = False\n","    # Base: base_size = 1024, image_size = 1024, crop_mode = False\n","    # Large: base_size = 1280, image_size = 1280, crop_mode = False\n","\n","    # Gundam: base_size = 1024, image_size = 640, crop_mode = True\n","\n","    res = model.infer(tokenizer,\n","                      prompt=prompt,\n","                      image_file=image_path, # Use the uploaded image path\n","                      output_path=output_path,\n","                      base_size=1024,\n","                      image_size=640,\n","                      crop_mode=True,\n","                      save_results=True,\n","                      test_compress=True)\n","\n","end_time = time.time()\n","\n","print(f\"Inference completed in {end_time - start_time:.2f} seconds\\n\")\n","print(\"=\" * 80)\n","print(\"OCR RESULT:\")\n","print(\"=\" * 80)\n","# The infer method might return different formats,\n","# we will assume it returns the text directly or in a structure we can access.\n","# You might need to adjust this based on the actual output format of model.infer\n","print(res)\n","print(\"=\" * 80)\n","\n","# Note: The infer method with save_results=True should save the output to output_path\n","# You might need to adjust the saving and downloading logic in the next cell\n","# depending on how model.infer saves the results."],"execution_count":19,"outputs":[{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n","  warnings.warn(\n","The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n","Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"]},{"output_type":"stream","name":"stdout","text":["Running OCR inference using model.infer...\n","\n","=====================\n","BASE:  torch.Size([1, 256, 1280])\n","PATCHES:  torch.Size([4, 100, 1280])\n","=====================\n","<|ref|>text<|/ref|><|det|>[[62, 31, 483, 171]]<|/det|>\n","·We assess a wide range of state-of-the-art LLMs for the first time and empirically show that they exhibit significant patterns of bias related to non-binary gender representations, leaving room for future improvement.  \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 198, 235, 225]]<|/det|>\n","## 2 Related Work  \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 246, 373, 272]]<|/det|>\n","### 2.1 Binary Gender Bias in LLMs  \n","\n","<|ref|>text<|/ref|><|det|>[[62, 283, 485, 992]]<|/det|>\n","Research on gender bias in artificial intelligence, especially in large language models (LLMs), has predominantly centered on binary gender categories, often reinforcing conventional stereotypes while overlooking the complexities of gender diversity (Blodgett et al., 2020; Nadeem et al., 2021; Schramowski et al., 2022; Stanovsky et al., 2019). Studies such as Bolukbasi et al. (2016) revealed that word embeddings trained in large corpora encode harmful gender stereotypes, associating men with technical roles and women with nurturing roles. Further research has demonstrated that LLMs often exhibit occupational gender bias, reinforcing male-dominated professions and associating women with domestic tasks (Zhao et al., 2018; Brown et al., 2020a; Wan et al., 2023; Ghosh and Caliskan, 2023; Chen et al., 2022). For example, Brown et al. (2020b) examined binary gender bias in GPT- 3 by prompting the model with phrases such as \"[He] was very\" and \"[She] was very\" and analyzing whether the adjectives and adverbs reflected gender stereotypes (e.g., \"handsome\" for men and \"beautiful\" for women). Chen et al. (2022) proposed a framework for measuring how LLMs reinforce gender stereotypes through role-based  \n","\n","<|ref|>table<|/ref|><|det|>[[515, 24, 933, 355]]<|/det|>\n","\n","<table><tr><td rowspan=\"2\">Pronoun Type</td><td rowspan=\"2\">Nom.</td><td rowspan=\"2\">Acc.</td><td colspan=\"2\">Possessive</td><td rowspan=\"2\">Ref.</td></tr><tr><td>Dep.</td><td>Indep.</td></tr><tr><td rowspan=\"2\">Binary</td><td>he</td><td>him</td><td>his</td><td>his</td><td>himself</td></tr><tr><td>she</td><td>her</td><td>her</td><td>hers</td><td>herself</td></tr><tr><td rowspan=\"2\">Neutral</td><td>they</td><td>them</td><td>their</td><td>theirs</td><td>themself</td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td rowspan=\"7\">Neo</td><td>thon</td><td>thon</td><td>thons</td><td>thons</td><td>thonself</td></tr><tr><td>e</td><td>em</td><td>es</td><td>ems</td><td>emself</td></tr><tr><td>ae</td><td>aer</td><td>aer</td><td>aers</td><td>aerself</td></tr><tr><td>co</td><td>co</td><td>cos</td><td>cos</td><td>coself</td></tr><tr><td>vi</td><td>vir</td><td>vis</td><td>virs</td><td>virself</td></tr><tr><td>xe</td><td>xem</td><td>xyr</td><td>xyr</td><td>xemself</td></tr><tr><td>ey</td><td>em</td><td>eir</td><td>eirs</td><td>emself</td></tr><tr><td></td><td>ze</td><td>zir</td><td>zir</td><td>zirs</td><td>zirself</td></tr></table>\n","\n","<|ref|>table_footnote<|/ref|><|det|>[[512, 375, 936, 421]]<|/det|>\n","Table 1: List of binary, gender-neutral, and neopronouns (Lauscher et al., 2022; Hossain et al., 2023).  \n","\n","<|ref|>text<|/ref|><|det|>[[512, 448, 936, 985]]<|/det|>\n","communities. Blodgett et al. (2020) argued that many studies assessing bias in NLP systems lack grounding in real- world harms and do not adequately consider \"to whom\" these biases are harmful, particularly overlooking non- binary identities. Although datasets like StereoSet (Nadeem et al., 2021) and CrowS- Pairs (Nangia et al., 2020) have made progress in measuring stereotypical biases, they do not specifically address non- binary representation or experiences. Recent work has begun addressing this gap. You et al. (2024) explored name- based gender prediction with a \"neutral\" gender category. Hossain et al. (2023) introduced the MISGENDERED framework, evaluating LLMs on their use of gender- neutral pronouns and neopronouns. Similarly, Ovalle et al. (2023) examined how LLMs misgender transgender and non- binary (TGNB) individuals, revealing that binary norms dominate AI behavior and showing LLMs are less\n","==================================================\n","image size:  (871, 784)\n","valid image tokens:  630\n","output texts tokens (valid):  1038\n","compression ratio:  1.65\n","==================================================\n","===============save results:===============\n"]},{"output_type":"stream","name":"stderr","text":["image: 0it [00:00, ?it/s]\n","other: 100%|██████████| 7/7 [00:00<00:00, 41352.29it/s]"]},{"output_type":"stream","name":"stdout","text":["Inference completed in 44.24 seconds\n","\n","================================================================================\n","OCR RESULT:\n","================================================================================\n","None\n","================================================================================\n"]},{"output_type":"stream","name":"stderr","text":["\n"]}]},{"cell_type":"markdown","metadata":{"id":"batch-header"},"source":["## 8. Batch Processing (Optional)\n","\n","Process multiple images at once."]},{"cell_type":"code","execution_count":27,"metadata":{"id":"batch-process","colab":{"base_uri":"https://localhost:8080/","height":1000},"executionInfo":{"status":"ok","timestamp":1761053029367,"user_tz":-180,"elapsed":99783,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"0b15fc8e-e16f-4dcb-cf3a-3a63a5901b03"},"outputs":[{"output_type":"stream","name":"stdout","text":["Upload multiple images for batch processing:\n"]},{"output_type":"display_data","data":{"text/plain":["<IPython.core.display.HTML object>"],"text/html":["\n","     <input type=\"file\" id=\"files-c9e2afb6-63d1-4b0d-aac5-2a29ba7c009c\" name=\"files[]\" multiple disabled\n","        style=\"border:none\" />\n","     <output id=\"result-c9e2afb6-63d1-4b0d-aac5-2a29ba7c009c\">\n","      Upload widget is only available when the cell has been executed in the\n","      current browser session. Please rerun this cell to enable.\n","      </output>\n","      <script>// Copyright 2017 Google LLC\n","//\n","// Licensed under the Apache License, Version 2.0 (the \"License\");\n","// you may not use this file except in compliance with the License.\n","// You may obtain a copy of the License at\n","//\n","//      http://www.apache.org/licenses/LICENSE-2.0\n","//\n","// Unless required by applicable law or agreed to in writing, software\n","// distributed under the License is distributed on an \"AS IS\" BASIS,\n","// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n","// See the License for the specific language governing permissions and\n","// limitations under the License.\n","\n","/**\n"," * @fileoverview Helpers for google.colab Python module.\n"," */\n","(function(scope) {\n","function span(text, styleAttributes = {}) {\n","  const element = document.createElement('span');\n","  element.textContent = text;\n","  for (const key of Object.keys(styleAttributes)) {\n","    element.style[key] = styleAttributes[key];\n","  }\n","  return element;\n","}\n","\n","// Max number of bytes which will be uploaded at a time.\n","const MAX_PAYLOAD_SIZE = 100 * 1024;\n","\n","function _uploadFiles(inputId, outputId) {\n","  const steps = uploadFilesStep(inputId, outputId);\n","  const outputElement = document.getElementById(outputId);\n","  // Cache steps on the outputElement to make it available for the next call\n","  // to uploadFilesContinue from Python.\n","  outputElement.steps = steps;\n","\n","  return _uploadFilesContinue(outputId);\n","}\n","\n","// This is roughly an async generator (not supported in the browser yet),\n","// where there are multiple asynchronous steps and the Python side is going\n","// to poll for completion of each step.\n","// This uses a Promise to block the python side on completion of each step,\n","// then passes the result of the previous step as the input to the next step.\n","function _uploadFilesContinue(outputId) {\n","  const outputElement = document.getElementById(outputId);\n","  const steps = outputElement.steps;\n","\n","  const next = steps.next(outputElement.lastPromiseValue);\n","  return Promise.resolve(next.value.promise).then((value) => {\n","    // Cache the last promise value to make it available to the next\n","    // step of the generator.\n","    outputElement.lastPromiseValue = value;\n","    return next.value.response;\n","  });\n","}\n","\n","/**\n"," * Generator function which is called between each async step of the upload\n"," * process.\n"," * @param {string} inputId Element ID of the input file picker element.\n"," * @param {string} outputId Element ID of the output display.\n"," * @return {!Iterable<!Object>} Iterable of next steps.\n"," */\n","function* uploadFilesStep(inputId, outputId) {\n","  const inputElement = document.getElementById(inputId);\n","  inputElement.disabled = false;\n","\n","  const outputElement = document.getElementById(outputId);\n","  outputElement.innerHTML = '';\n","\n","  const pickedPromise = new Promise((resolve) => {\n","    inputElement.addEventListener('change', (e) => {\n","      resolve(e.target.files);\n","    });\n","  });\n","\n","  const cancel = document.createElement('button');\n","  inputElement.parentElement.appendChild(cancel);\n","  cancel.textContent = 'Cancel upload';\n","  const cancelPromise = new Promise((resolve) => {\n","    cancel.onclick = () => {\n","      resolve(null);\n","    };\n","  });\n","\n","  // Wait for the user to pick the files.\n","  const files = yield {\n","    promise: Promise.race([pickedPromise, cancelPromise]),\n","    response: {\n","      action: 'starting',\n","    }\n","  };\n","\n","  cancel.remove();\n","\n","  // Disable the input element since further picks are not allowed.\n","  inputElement.disabled = true;\n","\n","  if (!files) {\n","    return {\n","      response: {\n","        action: 'complete',\n","      }\n","    };\n","  }\n","\n","  for (const file of files) {\n","    const li = document.createElement('li');\n","    li.append(span(file.name, {fontWeight: 'bold'}));\n","    li.append(span(\n","        `(${file.type || 'n/a'}) - ${file.size} bytes, ` +\n","        `last modified: ${\n","            file.lastModifiedDate ? file.lastModifiedDate.toLocaleDateString() :\n","                                    'n/a'} - `));\n","    const percent = span('0% done');\n","    li.appendChild(percent);\n","\n","    outputElement.appendChild(li);\n","\n","    const fileDataPromise = new Promise((resolve) => {\n","      const reader = new FileReader();\n","      reader.onload = (e) => {\n","        resolve(e.target.result);\n","      };\n","      reader.readAsArrayBuffer(file);\n","    });\n","    // Wait for the data to be ready.\n","    let fileData = yield {\n","      promise: fileDataPromise,\n","      response: {\n","        action: 'continue',\n","      }\n","    };\n","\n","    // Use a chunked sending to avoid message size limits. See b/62115660.\n","    let position = 0;\n","    do {\n","      const length = Math.min(fileData.byteLength - position, MAX_PAYLOAD_SIZE);\n","      const chunk = new Uint8Array(fileData, position, length);\n","      position += length;\n","\n","      const base64 = btoa(String.fromCharCode.apply(null, chunk));\n","      yield {\n","        response: {\n","          action: 'append',\n","          file: file.name,\n","          data: base64,\n","        },\n","      };\n","\n","      let percentDone = fileData.byteLength === 0 ?\n","          100 :\n","          Math.round((position / fileData.byteLength) * 100);\n","      percent.textContent = `${percentDone}% done`;\n","\n","    } while (position < fileData.byteLength);\n","  }\n","\n","  // All done.\n","  yield {\n","    response: {\n","      action: 'complete',\n","    }\n","  };\n","}\n","\n","scope.google = scope.google || {};\n","scope.google.colab = scope.google.colab || {};\n","scope.google.colab._files = {\n","  _uploadFiles,\n","  _uploadFilesContinue,\n","};\n","})(self);\n","</script> "]},"metadata":{}},{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.\n","  warnings.warn(\n","The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n","Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"]},{"output_type":"stream","name":"stdout","text":["Saving Capture.jpg to Capture (5).jpg\n","Saving Capture1.jpg to Capture1 (1).jpg\n","\n","Processing Capture (5).jpg...\n","=====================\n","BASE:  torch.Size([1, 256, 1280])\n","PATCHES:  torch.Size([4, 100, 1280])\n","=====================\n","<|ref|>text<|/ref|><|det|>[[62, 31, 483, 171]]<|/det|>\n","·We assess a wide range of state-of-the-art LLMs for the first time and empirically show that they exhibit significant patterns of bias related to non-binary gender representations, leaving room for future improvement.  \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 198, 235, 225]]<|/det|>\n","## 2 Related Work  \n","\n","<|ref|>sub_title<|/ref|><|det|>[[62, 246, 373, 272]]<|/det|>\n","### 2.1 Binary Gender Bias in LLMs  \n","\n","<|ref|>text<|/ref|><|det|>[[62, 283, 485, 992]]<|/det|>\n","Research on gender bias in artificial intelligence, especially in large language models (LLMs), has predominantly centered on binary gender categories, often reinforcing conventional stereotypes while overlooking the complexities of gender diversity (Blodgett et al., 2020; Nadeem et al., 2021; Schramowski et al., 2022; Stanovsky et al., 2019). Studies such as Bolukbasi et al. (2016) revealed that word embeddings trained in large corpora encode harmful gender stereotypes, associating men with technical roles and women with nurturing roles. Further research has demonstrated that LLMs often exhibit occupational gender bias, reinforcing male-dominated professions and associating women with domestic tasks (Zhao et al., 2018; Brown et al., 2020a; Wan et al., 2023; Ghosh and Caliskan, 2023; Chen et al., 2022). For example, Brown et al. (2020b) examined binary gender bias in GPT- 3 by prompting the model with phrases such as \"[He] was very\" and \"[She] was very\" and analyzing whether the adjectives and adverbs reflected gender stereotypes (e.g., \"handsome\" for men and \"beautiful\" for women). Chen et al. (2022) proposed a framework for measuring how LLMs reinforce gender stereotypes through role-based  \n","\n","<|ref|>table<|/ref|><|det|>[[515, 24, 933, 355]]<|/det|>\n","\n","<table><tr><td rowspan=\"2\">Pronoun Type</td><td rowspan=\"2\">Nom.</td><td rowspan=\"2\">Acc.</td><td colspan=\"2\">Possessive</td><td rowspan=\"2\">Ref.</td></tr><tr><td>Dep.</td><td>Indep.</td></tr><tr><td rowspan=\"2\">Binary</td><td>he</td><td>him</td><td>his</td><td>his</td><td>himself</td></tr><tr><td>she</td><td>her</td><td>her</td><td>hers</td><td>herself</td></tr><tr><td rowspan=\"2\">Neutral</td><td>they</td><td>them</td><td>their</td><td>theirs</td><td>themself</td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td rowspan=\"7\">Neo</td><td>thon</td><td>thon</td><td>thons</td><td>thons</td><td>thonself</td></tr><tr><td>e</td><td>em</td><td>es</td><td>ems</td><td>emself</td></tr><tr><td>ae</td><td>aer</td><td>aer</td><td>aers</td><td>aerself</td></tr><tr><td>co</td><td>co</td><td>cos</td><td>cos</td><td>coself</td></tr><tr><td>vi</td><td>vir</td><td>vis</td><td>virs</td><td>virself</td></tr><tr><td>xe</td><td>xem</td><td>xyr</td><td>xyr</td><td>xemself</td></tr><tr><td>ey</td><td>em</td><td>eir</td><td>eirs</td><td>emself</td></tr><tr><td></td><td>ze</td><td>zir</td><td>zir</td><td>zirs</td><td>zirself</td></tr></table>\n","\n","<|ref|>table_footnote<|/ref|><|det|>[[512, 375, 936, 421]]<|/det|>\n","Table 1: List of binary, gender-neutral, and neopronouns (Lauscher et al., 2022; Hossain et al., 2023).  \n","\n","<|ref|>text<|/ref|><|det|>[[512, 448, 936, 985]]<|/det|>\n","communities. Blodgett et al. (2020) argued that many studies assessing bias in NLP systems lack grounding in real- world harms and do not adequately consider \"to whom\" these biases are harmful, particularly overlooking non- binary identities. Although datasets like StereoSet (Nadeem et al., 2021) and CrowS- Pairs (Nangia et al., 2020) have made progress in measuring stereotypical biases, they do not specifically address non- binary representation or experiences. Recent work has begun addressing this gap. You et al. (2024) explored name- based gender prediction with a \"neutral\" gender category. Hossain et al. (2023) introduced the MISGENDERED framework, evaluating LLMs on their use of gender- neutral pronouns and neopronouns. Similarly, Ovalle et al. (2023) examined how LLMs misgender transgender and non- binary (TGNB) individuals, revealing that binary norms dominate AI behavior and showing LLMs are less\n","==================================================\n","image size:  (871, 784)\n","valid image tokens:  630\n","output texts tokens (valid):  1038\n","compression ratio:  1.65\n","==================================================\n","===============save results:===============\n"]},{"output_type":"stream","name":"stderr","text":["image: 0it [00:00, ?it/s]\n","other: 100%|██████████| 7/7 [00:00<00:00, 68279.37it/s]"]},{"output_type":"stream","name":"stdout","text":["✓ Capture (5).jpg processed successfully. Output saved to /content/batch_ocr_output\n","\n","Processing Capture1 (1).jpg...\n"]},{"output_type":"stream","name":"stderr","text":["\n","The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n","Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"]},{"output_type":"stream","name":"stdout","text":["=====================\n","BASE:  torch.Size([1, 256, 1280])\n","PATCHES:  torch.Size([4, 100, 1280])\n","=====================\n","<|ref|>text<|/ref|><|det|>[[20, 0, 475, 338]]<|/det|>\n","Retrieval Augmented Generation system for LLM agents. SCMRAG introduces a novel paradigm that moves beyond the static retrieval methods of traditional RAG systems by integrating a dynamic, LLM- assisted knowledge graph for information retrieval. This knowledge graph evolves with the system, updating and refining itself based on the SCMRAG's agent driven interactions and query- answer pair generations. Crucially, SCMRAG also includes a self- corrective mechanism, enabling it to identify when information is missing or inadequate and autonomously retrieves it from external sources (e.g. web, enterprise information sources, or any other available information resources) by generating a new retrieval query without relying on predefined algorithms. This self- corrective step ensures that up- to- date and accurate information is always accessible.  \n","\n","<|ref|>text<|/ref|><|det|>[[20, 338, 475, 500]]<|/det|>\n","Another key feature of SCMRAG is its LLM agent driven internal reasoning agent. It gives the system the decision- making capability to determine whether the knowledge graph contains sufficient information to answer a query or whether a corrective step is necessary to enhance the retrieval process. It enables SCMRAG to adapt to a wide range of tasks and domains while minimizing hallucinations.  \n","\n","<|ref|>text<|/ref|><|det|>[[20, 504, 475, 602]]<|/det|>\n","SCMRAG ensures that only the most relevant content is retrieved from available data sources, even when the knowledge base is incomplete or outdated. The key contributions of our proposed method are as follows:  \n","\n","<|ref|>text<|/ref|><|det|>[[42, 618, 475, 978]]<|/det|>\n","1. We introduce a novel RAG paradigm that employs a dynamic, self-updating knowledge graph to guide multihop retrieval, allowing for more context-aware and accurate information retrieval. \n","2. We propose a self-corrective, agent-driven mechanism that enables SCMRAG to autonomously update missing or outdated information by fetching data from external sources. \n","3. We achieve state-of-the-art performance on four datasets, even when using a quantized LLM with significantly fewer parameters. Notably, these results are obtained without any LLM fine-tuning. \n","4. We demonstrate that SCMRAG's advanced reasoning capabilities significantly reduce hallucinations by ensuring that only the most relevant and accurate information is provided to the LLM for generation.  \n","\n","<|ref|>text<|/ref|><|det|>[[520, 0, 828, 20]]<|/det|>\n","pretraining with vast amounts of knowledge.  \n","\n","<|ref|>sub_title<|/ref|><|det|>[[518, 55, 680, 80]]<|/det|>\n","### 2.1 Initial Work  \n","\n","<|ref|>text<|/ref|><|det|>[[518, 85, 972, 384]]<|/det|>\n","Works such as the RAG model proposed by Lewis et al. [13] were instrumental in showing that augmenting a generation model with a retrieval step could greatly improve the factual correctness of AI- generated text. Lewis et al. introduced the two- stage RAG process, where a retriever is responsible for fetching relevant documents based on a query, and a generator produces text conditioned on these retrieved documents. This approach was proven to outperform purely generative or purely extractive models in tasks such as knowledge- based QA and passage generation. This dual system ensures that the model's output is grounded in real- world data. It highlighted the importance of coupling retrieval systems with LLMs to enhance performance in open- domain tasks.  \n","\n","<|ref|>sub_title<|/ref|><|det|>[[518, 410, 849, 436]]<|/det|>\n","### 2.2 Advances in RAG Architecture  \n","\n","<|ref|>text<|/ref|><|det|>[[518, 441, 972, 616]]<|/det|>\n","Several models have introduced further innovations to improve the efficiency and accuracy of retrieval mechanisms. Karpukhin et al. developed Dense Passage Retrieval (DPR) [9], a technique that leverages dense vector representations for more accurate retrieval of semantically relevant passages. DPR became foundational in improving the retriever's ability to return highly relevant documents from vast corpora.  \n","\n","<|ref|>text<|/ref|><|det|>[[518, 618, 972, 789]]<|/det|>\n","Later advancements in RAG systems sought to optimize both the retrieval and generation phases. Fusion- in- Decoder [8] integrated multiple retrieved documents simultaneously within the decoder, allowing the model to generate answers that more holistically synthesized information from various sources. This method allowed for more contextual outputs, and was effective in handling multi- hop questions requiring reasoning across multiple documents.  \n","\n","<|ref|>text<|/ref|><|det|>[[518, 792, 972, 960]]<|/det|>\n","A critical issue with these approaches is the reliance on static retrieval corpora, which limits the system's ability to access up- to- date information, leading to outdated or incomplete responses in rapidly evolving domains. Moreover, the retriever and generator components in transformer based RAG models are generally trained separately. This often leads to mismatches between retrieved documents and generated content.\n","==================================================\n","image size:  (887, 719)\n","valid image tokens:  607\n","output texts tokens (valid):  1018\n","compression ratio:  1.68\n","==================================================\n","===============save results:===============\n"]},{"output_type":"stream","name":"stderr","text":["image: 0it [00:00, ?it/s]\n","other: 100%|██████████| 11/11 [00:00<00:00, 72657.23it/s]"]},{"output_type":"stream","name":"stdout","text":["✓ Capture1 (1).jpg processed successfully. Output saved to /content/batch_ocr_output\n","\n","================================================================================\n","BATCH PROCESSING SUMMARY\n","================================================================================\n","\n","--- Capture (5).jpg ---\n","Processed. Output saved to /content/batch_ocr_output\n","\n","\n","--- Capture1 (1).jpg ---\n","Processed. Output saved to /content/batch_ocr_output\n","\n","\n","Detailed results are saved in the directory: /content/batch_ocr_output\n"]},{"output_type":"stream","name":"stderr","text":["\n"]}],"source":["from PIL import Image\n","import time\n","import os\n","import torch\n","\n","# Upload multiple images\n","print(\"Upload multiple images for batch processing:\")\n","uploaded_files = files.upload()\n","\n","results = {}\n","output_path = '/content/batch_ocr_output' # Define a directory for batch output\n","\n","# Create output directory if it doesn't exist\n","if not os.path.exists(output_path):\n","    os.makedirs(output_path)\n","\n","for filename in uploaded_files.keys():\n","    print(f\"\\nProcessing {filename}...\")\n","\n","    try:\n","        # Construct the full image path in the current working directory\n","        image_path = os.path.join(os.getcwd(), filename)\n","\n","        # Define prompt (adjust based on DeepSeek-OCR's expected format)\n","        prompt = \"<image>\\n<|grounding|>Convert the document to markdown. \"\n","\n","        with torch.no_grad():\n","            # Use the infer method for batch processing\n","            res = model.infer(tokenizer,\n","                              prompt=prompt,\n","                              image_file=image_path, # Use the uploaded image path\n","                              output_path=output_path,\n","                              base_size=1024,\n","                              image_size=640,\n","                              crop_mode=True,\n","                              save_results=True,\n","                              test_compress=True)\n","\n","            # The infer method with save_results=True saves the output to output_path\n","            # You might need to adjust how to retrieve or confirm the saved result\n","            # For this example, we'll just note that it was processed.\n","            results[filename] = f\"Processed. Output saved to {output_path}\"\n","            print(f\"✓ {filename} processed successfully. Output saved to {output_path}\")\n","\n","    except Exception as e:\n","        print(f\"✗ Error processing {filename}: {str(e)}\")\n","        results[filename] = f\"Error: {str(e)}\"\n","\n","# Display all results (or confirmation of processing)\n","print(\"\\n\" + \"=\" * 80)\n","print(\"BATCH PROCESSING SUMMARY\")\n","print(\"=\" * 80)\n","\n","for filename, result in results.items():\n","    print(f\"\\n--- {filename} ---\")\n","    print(result)\n","    print()\n","\n","print(f\"\\nDetailed results are saved in the directory: {output_path}\")\n","\n","# Note: Downloading the batch results as a single file might require\n","# zipping the output directory or iterating through saved files.\n","# This part is commented out as model.infer handles saving.\n","# with open('batch_results.txt', 'w', encoding='utf-8') as f:\n","#     for filename, result in results.items():\n","#         f.write(f\"{'='*80}\\n\")\n","#         f.write(f\"File: {filename}\\n\")\n","#         f.write(f\"{'='*80}\\n\")\n","#         f.write(result)\n","#         f.write(f\"\\n\\n\")\n","#\n","# files.download('batch_results.txt')"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"b533ce66","executionInfo":{"status":"ok","timestamp":1761052768625,"user_tz":-180,"elapsed":10037,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"ad79b0f6-cbc4-46cd-fce1-9d542a0d3af4"},"source":["from transformers import AutoModel, AutoTokenizer\n","import torch\n","import os\n","\n","print(\"Loading DeepSeek-OCR model for batch processing...\")\n","\n","os.environ[\"CUDA_VISIBLE_DEVICES\"] = '0'\n","model_name = 'deepseek-ai/DeepSeek-OCR'\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n","# Removing attn_implementation='flash_attention_2' as a troubleshooting step\n","model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)\n","model = model.eval().cuda().to(torch.bfloat16)\n","\n","print(\"Model loaded successfully for batch processing!\")"],"execution_count":23,"outputs":[{"output_type":"stream","name":"stdout","text":["Loading DeepSeek-OCR model for batch processing...\n"]},{"output_type":"stream","name":"stderr","text":["You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.\n","Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"output_type":"stream","name":"stdout","text":["Model loaded successfully for batch processing!\n"]}]},{"cell_type":"markdown","metadata":{"id":"troubleshooting-header"},"source":["## Troubleshooting\n","\n","### Common Issues:\n","\n","1. **Out of Memory (OOM):**\n","   - Use a higher-tier GPU (A100, V100)\n","   - Reduce image resolution before processing\n","   - Enable gradient checkpointing\n","\n","2. **Flash Attention Installation Fails:**\n","   - Try removing `attn_implementation='flash_attention_2'` parameter\n","   - Fallback to standard attention mechanism\n","\n","3. **Model Download Slow:**\n","   - This is normal for large models (may take 10-15 minutes)\n","   - Model is cached after first download\n","\n","4. **Image Format Issues:**\n","   - Ensure image is in RGB format\n","   - Convert: `img = img.convert('RGB')`\n","\n","### Performance Tips:\n","\n","- Use images close to native resolutions: 512×512, 640×640, 1024×1024, 1280×1280\n","- For faster inference, use `torch.float16` (already enabled)\n","- Batch processing is more efficient for multiple images"]},{"cell_type":"markdown","metadata":{"id":"cleanup-header"},"source":["## Cleanup (Optional)\n","\n","Free up GPU memory when done."]},{"cell_type":"code","execution_count":21,"metadata":{"id":"cleanup","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1761052484637,"user_tz":-180,"elapsed":178,"user":{"displayName":"hg ahcz","userId":"17954928916181846033"}},"outputId":"80686745-04cb-4a6f-f2a6-f0ef2b5afa66"},"outputs":[{"output_type":"stream","name":"stdout","text":["GPU memory cleared\n"]}],"source":["# Clear GPU memory\n","import gc\n","\n","del model\n","del tokenizer\n","gc.collect()\n","torch.cuda.empty_cache()\n","\n","print(\"GPU memory cleared\")"]}],"metadata":{"accelerator":"GPU","colab":{"gpuType":"L4","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.0"}},"nbformat":4,"nbformat_minor":0}