DeepSeek-OCR Google Colab Notebook
A ready-to-use Google Colab notebook for running DeepSeek-OCR, a state-of-the-art optical character recognition model that converts images and documents to markdown format with high accuracy.
π Quick Start
Open in Google Colab
Click the badge below to open the notebook directly in Google Colab:
Or download the notebook from this repository and upload to Google Colab manually.
Steps:
- Click the "Open in Colab" badge above
- Select Runtime β Change runtime type β GPU (T4 or better recommended)
- Run all cells sequentially
- Upload your image when prompted
- Get markdown-formatted text output
β¨ Features
- Easy Setup: One-click deployment on Google Colab
- GPU Acceleration: Optimized for NVIDIA GPUs (T4, L4, A100, V100)
- Flexible Processing: Single image or batch processing support
- High Quality OCR: Converts documents to markdown with text detection and grounding
- Multiple Resolution Modes: Tiny, Small, Base, Large, and Gundam (cropped) modes
- Real-time Preview: View uploaded images before processing
π Requirements
For Google Colab:
- GPU Runtime (T4 or better recommended)
- ~15-20 minutes setup time
- ~23GB GPU memory (L4 or equivalent)
For Local Setup:
- NVIDIA GPU with CUDA support (12.1+)
- Python 3.8+
- PyTorch 2.0+
- 22GB+ GPU VRAM
π‘ Usage
Single Image Processing
- Upload your image in the designated cell
- Run the inference cell to process the image
- Download results from the output directory
Example prompt:
prompt = "<image>\n<|grounding|>Convert the document to markdown."
Batch Processing
Process multiple images at once with automatic iteration through uploaded files. Results are saved to the output directory.
βοΈ Model Configuration
The notebook supports different processing modes:
Mode | base_size | image_size | crop_mode | Use Case |
---|---|---|---|---|
Tiny | 512 | 512 | False | Quick processing, lower quality |
Small | 640 | 640 | False | Balanced speed/quality |
Base | 1024 | 1024 | False | Standard quality |
Large | 1280 | 1280 | False | High quality, slower |
Gundam | 1024 | 640 | True | Recommended (cropped processing) |
Default Configuration (Recommended):
base_size = 1024
image_size = 640
crop_mode = True
π€ Output Format
The model outputs:
- Markdown formatted text with proper heading structure
- Bounding box coordinates for detected elements (
<|det|>
) - Element references (
<|ref|>
) for text, titles, tables, etc. - Tables converted to markdown format
- Compression ratio metrics for analysis
Example output structure:
<|ref|>text<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
Extracted text content here...
<|ref|>sub_title<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
## Heading Text
π οΈ Troubleshooting
Out of Memory (OOM)
- Use a higher-tier GPU (A100, V100)
- Reduce image resolution before processing
- Use smaller processing modes (Tiny or Small)
Flash Attention Installation Fails
- The notebook removes
attn_implementation='flash_attention_2'
by default - Standard attention mechanism is used as fallback
Model Download Slow
- First download takes 10-15 minutes (normal)
- Model is cached after first download
- Check your Colab internet connection
Image Format Issues
# Ensure RGB format
from PIL import Image
img = Image.open('image.png').convert('RGB')
π Performance Tips
- Image Resolution: Use native resolutions (512, 640, 1024, 1280) for best results
- Batch Processing: More efficient for multiple images
- GPU Selection: L4 or better recommended for faster processing
- Compression: Enable
test_compress=True
to see compression metrics
π Repository Files
DeepSeek_OCR_Colab.ipynb
- Main Google Colab notebookrequirements.txt
- Python dependenciesLICENSE
- MIT LicenseREADME.md
- This documentation
π Credits
Based on the official DeepSeek-OCR repository:
- Repository: deepseek-ai/DeepSeek-OCR
- Model: deepseek-ai/DeepSeek-OCR on HuggingFace
- Paper: DeepSeek-OCR: High-Accuracy Document OCR
π License
This project is licensed under the MIT License - see the LICENSE file for details.
The DeepSeek-OCR model itself is subject to its own license terms from DeepSeek AI.
π Links
- Hugging Face Model: https://huggingface.co/ahczhg/DeepSeek-OCR-Colab
- Hugging Face Space: https://huggingface.co/spaces/ahczhg/DeepSeek-OCR-Colab
- Open in Colab:
π Citation
If you use this notebook in your research, please cite the original DeepSeek-OCR paper:
@article{deepseek2024ocr,
title={DeepSeek-OCR: High-Accuracy Document OCR},
author={DeepSeek AI},
year={2024}
}
Note: This is a community-contributed notebook wrapper for the DeepSeek-OCR model. For the official model and implementation, please visit the DeepSeek-OCR repository.