ahczhg
/

DeepSeek-OCR-Colab

+---
+license: mit
+tags:
+  - ocr
+  - document-processing
+  - computer-vision
+  - deepseek
+  - colab
+  - jupyter
+  - optical-character-recognition
+  - text-detection
+  - document-to-markdown
+  - notebook
+library_name: transformers
+pipeline_tag: image-to-text
+---
+# DeepSeek-OCR Google Colab Notebook
+A ready-to-use Google Colab notebook for running DeepSeek-OCR, a state-of-the-art optical character recognition model that converts images and documents to markdown format with high accuracy.
+## 🚀 Quick Start
+### Open in Google Colab
+Click the badge below to open the notebook directly in Google Colab:
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ahczhg/DeepSeek-OCR-Colab/blob/main/DeepSeek_OCR_Colab.ipynb)
+**Or download the notebook from this repository and upload to Google Colab manually.**
+### Steps:
+1. Click the "Open in Colab" badge above
+2. Select **Runtime → Change runtime type → GPU** (T4 or better recommended)
+3. Run all cells sequentially
+4. Upload your image when prompted
+5. Get markdown-formatted text output
+## ✨ Features
+- **Easy Setup**: One-click deployment on Google Colab
+- **GPU Acceleration**: Optimized for NVIDIA GPUs (T4, L4, A100, V100)
+- **Flexible Processing**: Single image or batch processing support
+- **High Quality OCR**: Converts documents to markdown with text detection and grounding
+- **Multiple Resolution Modes**: Tiny, Small, Base, Large, and Gundam (cropped) modes
+- **Real-time Preview**: View uploaded images before processing
+## 📋 Requirements
+### For Google Colab:
+- GPU Runtime (T4 or better recommended)
+- ~15-20 minutes setup time
+- ~23GB GPU memory (L4 or equivalent)
+### For Local Setup:
+- NVIDIA GPU with CUDA support (12.1+)
+- Python 3.8+
+- PyTorch 2.0+
+- 22GB+ GPU VRAM
+## 💡 Usage
+### Single Image Processing
+1. **Upload your image** in the designated cell
+2. **Run the inference cell** to process the image
+3. **Download results** from the output directory
+Example prompt:
+```python
+prompt = "<image>\n<|grounding|>Convert the document to markdown."
+```
+### Batch Processing
+Process multiple images at once with automatic iteration through uploaded files. Results are saved to the output directory.
+## ⚙️ Model Configuration
+The notebook supports different processing modes:
+| Mode | base_size | image_size | crop_mode | Use Case |
+|------|-----------|------------|-----------|----------|
+| Tiny | 512 | 512 | False | Quick processing, lower quality |
+| Small | 640 | 640 | False | Balanced speed/quality |
+| Base | 1024 | 1024 | False | Standard quality |
+| Large | 1280 | 1280 | False | High quality, slower |
+| Gundam | 1024 | 640 | True | Recommended (cropped processing) |
+**Default Configuration (Recommended):**
+```python
+base_size = 1024
+image_size = 640
+crop_mode = True
+```
+## 📤 Output Format
+The model outputs:
+- **Markdown formatted text** with proper heading structure
+- **Bounding box coordinates** for detected elements (`<|det|>`)
+- **Element references** (`<|ref|>`) for text, titles, tables, etc.
+- **Tables** converted to markdown format
+- **Compression ratio** metrics for analysis
+Example output structure:
+```
+<|ref|>text<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
+Extracted text content here...
+<|ref|>sub_title<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
+## Heading Text
+```
+## 🛠️ Troubleshooting
+### Out of Memory (OOM)
+- Use a higher-tier GPU (A100, V100)
+- Reduce image resolution before processing
+- Use smaller processing modes (Tiny or Small)
+### Flash Attention Installation Fails
+- The notebook removes `attn_implementation='flash_attention_2'` by default
+- Standard attention mechanism is used as fallback
+### Model Download Slow
+- First download takes 10-15 minutes (normal)
+- Model is cached after first download
+- Check your Colab internet connection
+### Image Format Issues
+```python
+# Ensure RGB format
+from PIL import Image
+img = Image.open('image.png').convert('RGB')
+```
+## 📊 Performance Tips
+1. **Image Resolution**: Use native resolutions (512, 640, 1024, 1280) for best results
+2. **Batch Processing**: More efficient for multiple images
+3. **GPU Selection**: L4 or better recommended for faster processing
+4. **Compression**: Enable `test_compress=True` to see compression metrics
+## 📁 Repository Files
+- `DeepSeek_OCR_Colab.ipynb` - Main Google Colab notebook
+- `requirements.txt` - Python dependencies
+- `LICENSE` - MIT License
+- `README.md` - This documentation
+## 🙏 Credits
+Based on the official DeepSeek-OCR repository:
+- **Repository**: [deepseek-ai/DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR)
+- **Model**: [deepseek-ai/DeepSeek-OCR on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
+- **Paper**: DeepSeek-OCR: High-Accuracy Document OCR
+## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+The DeepSeek-OCR model itself is subject to its own license terms from DeepSeek AI.
+## 🔗 Links
+- **Hugging Face Model**: [https://huggingface.co/ahczhg/DeepSeek-OCR-Colab](https://huggingface.co/ahczhg/DeepSeek-OCR-Colab)
+- **Hugging Face Space**: [https://huggingface.co/spaces/ahczhg/DeepSeek-OCR-Colab](https://huggingface.co/spaces/ahczhg/DeepSeek-OCR-Colab)
+- **Open in Colab**: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ahczhg/DeepSeek-OCR-Colab/blob/main/DeepSeek_OCR_Colab.ipynb)
+## 📖 Citation
+If you use this notebook in your research, please cite the original DeepSeek-OCR paper:
+```bibtex
+@article{deepseek2024ocr,
+  title={DeepSeek-OCR: High-Accuracy Document OCR},
+  author={DeepSeek AI},
+  year={2024}
+}
+```
+---
+**Note**: This is a community-contributed notebook wrapper for the DeepSeek-OCR model. For the official model and implementation, please visit the [DeepSeek-OCR repository](https://github.com/deepseek-ai/DeepSeek-OCR).