ahczhg commited on
Commit
721245a
Β·
verified Β·
1 Parent(s): 09393ef

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +185 -0
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - ocr
5
+ - document-processing
6
+ - computer-vision
7
+ - deepseek
8
+ - colab
9
+ - jupyter
10
+ - optical-character-recognition
11
+ - text-detection
12
+ - document-to-markdown
13
+ - notebook
14
+ library_name: transformers
15
+ pipeline_tag: image-to-text
16
+ ---
17
+
18
+ # DeepSeek-OCR Google Colab Notebook
19
+
20
+ A ready-to-use Google Colab notebook for running DeepSeek-OCR, a state-of-the-art optical character recognition model that converts images and documents to markdown format with high accuracy.
21
+
22
+ ## πŸš€ Quick Start
23
+
24
+ ### Open in Google Colab
25
+
26
+ Click the badge below to open the notebook directly in Google Colab:
27
+
28
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ahczhg/DeepSeek-OCR-Colab/blob/main/DeepSeek_OCR_Colab.ipynb)
29
+
30
+ **Or download the notebook from this repository and upload to Google Colab manually.**
31
+
32
+ ### Steps:
33
+ 1. Click the "Open in Colab" badge above
34
+ 2. Select **Runtime β†’ Change runtime type β†’ GPU** (T4 or better recommended)
35
+ 3. Run all cells sequentially
36
+ 4. Upload your image when prompted
37
+ 5. Get markdown-formatted text output
38
+
39
+ ## ✨ Features
40
+
41
+ - **Easy Setup**: One-click deployment on Google Colab
42
+ - **GPU Acceleration**: Optimized for NVIDIA GPUs (T4, L4, A100, V100)
43
+ - **Flexible Processing**: Single image or batch processing support
44
+ - **High Quality OCR**: Converts documents to markdown with text detection and grounding
45
+ - **Multiple Resolution Modes**: Tiny, Small, Base, Large, and Gundam (cropped) modes
46
+ - **Real-time Preview**: View uploaded images before processing
47
+
48
+ ## πŸ“‹ Requirements
49
+
50
+ ### For Google Colab:
51
+ - GPU Runtime (T4 or better recommended)
52
+ - ~15-20 minutes setup time
53
+ - ~23GB GPU memory (L4 or equivalent)
54
+
55
+ ### For Local Setup:
56
+ - NVIDIA GPU with CUDA support (12.1+)
57
+ - Python 3.8+
58
+ - PyTorch 2.0+
59
+ - 22GB+ GPU VRAM
60
+
61
+ ## πŸ’‘ Usage
62
+
63
+ ### Single Image Processing
64
+
65
+ 1. **Upload your image** in the designated cell
66
+ 2. **Run the inference cell** to process the image
67
+ 3. **Download results** from the output directory
68
+
69
+ Example prompt:
70
+ ```python
71
+ prompt = "<image>\n<|grounding|>Convert the document to markdown."
72
+ ```
73
+
74
+ ### Batch Processing
75
+
76
+ Process multiple images at once with automatic iteration through uploaded files. Results are saved to the output directory.
77
+
78
+ ## βš™οΈ Model Configuration
79
+
80
+ The notebook supports different processing modes:
81
+
82
+ | Mode | base_size | image_size | crop_mode | Use Case |
83
+ |------|-----------|------------|-----------|----------|
84
+ | Tiny | 512 | 512 | False | Quick processing, lower quality |
85
+ | Small | 640 | 640 | False | Balanced speed/quality |
86
+ | Base | 1024 | 1024 | False | Standard quality |
87
+ | Large | 1280 | 1280 | False | High quality, slower |
88
+ | Gundam | 1024 | 640 | True | Recommended (cropped processing) |
89
+
90
+ **Default Configuration (Recommended):**
91
+ ```python
92
+ base_size = 1024
93
+ image_size = 640
94
+ crop_mode = True
95
+ ```
96
+
97
+ ## πŸ“€ Output Format
98
+
99
+ The model outputs:
100
+ - **Markdown formatted text** with proper heading structure
101
+ - **Bounding box coordinates** for detected elements (`<|det|>`)
102
+ - **Element references** (`<|ref|>`) for text, titles, tables, etc.
103
+ - **Tables** converted to markdown format
104
+ - **Compression ratio** metrics for analysis
105
+
106
+ Example output structure:
107
+ ```
108
+ <|ref|>text<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
109
+ Extracted text content here...
110
+
111
+ <|ref|>sub_title<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
112
+ ## Heading Text
113
+ ```
114
+
115
+ ## πŸ› οΈ Troubleshooting
116
+
117
+ ### Out of Memory (OOM)
118
+ - Use a higher-tier GPU (A100, V100)
119
+ - Reduce image resolution before processing
120
+ - Use smaller processing modes (Tiny or Small)
121
+
122
+ ### Flash Attention Installation Fails
123
+ - The notebook removes `attn_implementation='flash_attention_2'` by default
124
+ - Standard attention mechanism is used as fallback
125
+
126
+ ### Model Download Slow
127
+ - First download takes 10-15 minutes (normal)
128
+ - Model is cached after first download
129
+ - Check your Colab internet connection
130
+
131
+ ### Image Format Issues
132
+ ```python
133
+ # Ensure RGB format
134
+ from PIL import Image
135
+ img = Image.open('image.png').convert('RGB')
136
+ ```
137
+
138
+ ## πŸ“Š Performance Tips
139
+
140
+ 1. **Image Resolution**: Use native resolutions (512, 640, 1024, 1280) for best results
141
+ 2. **Batch Processing**: More efficient for multiple images
142
+ 3. **GPU Selection**: L4 or better recommended for faster processing
143
+ 4. **Compression**: Enable `test_compress=True` to see compression metrics
144
+
145
+ ## πŸ“ Repository Files
146
+
147
+ - `DeepSeek_OCR_Colab.ipynb` - Main Google Colab notebook
148
+ - `requirements.txt` - Python dependencies
149
+ - `LICENSE` - MIT License
150
+ - `README.md` - This documentation
151
+
152
+ ## πŸ™ Credits
153
+
154
+ Based on the official DeepSeek-OCR repository:
155
+ - **Repository**: [deepseek-ai/DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR)
156
+ - **Model**: [deepseek-ai/DeepSeek-OCR on HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
157
+ - **Paper**: DeepSeek-OCR: High-Accuracy Document OCR
158
+
159
+ ## πŸ“„ License
160
+
161
+ This project is licensed under the MIT License - see the LICENSE file for details.
162
+
163
+ The DeepSeek-OCR model itself is subject to its own license terms from DeepSeek AI.
164
+
165
+ ## πŸ”— Links
166
+
167
+ - **Hugging Face Model**: [https://huggingface.co/ahczhg/DeepSeek-OCR-Colab](https://huggingface.co/ahczhg/DeepSeek-OCR-Colab)
168
+ - **Hugging Face Space**: [https://huggingface.co/spaces/ahczhg/DeepSeek-OCR-Colab](https://huggingface.co/spaces/ahczhg/DeepSeek-OCR-Colab)
169
+ - **Open in Colab**: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ahczhg/DeepSeek-OCR-Colab/blob/main/DeepSeek_OCR_Colab.ipynb)
170
+
171
+ ## πŸ“– Citation
172
+
173
+ If you use this notebook in your research, please cite the original DeepSeek-OCR paper:
174
+
175
+ ```bibtex
176
+ @article{deepseek2024ocr,
177
+ title={DeepSeek-OCR: High-Accuracy Document OCR},
178
+ author={DeepSeek AI},
179
+ year={2024}
180
+ }
181
+ ```
182
+
183
+ ---
184
+
185
+ **Note**: This is a community-contributed notebook wrapper for the DeepSeek-OCR model. For the official model and implementation, please visit the [DeepSeek-OCR repository](https://github.com/deepseek-ai/DeepSeek-OCR).