Yaz Hobooti commited on
Commit
37c62cf
Β·
1 Parent(s): 6e39e00

Update to Gradio-based PDF comparison tool with advanced features

Browse files

- Replace Flask app with modern Gradio interface
- Add comprehensive PDF analysis features:
- Visual difference detection with bounding boxes
- OCR and spell checking capabilities
- Barcode/QR code detection and validation
- CMYK color analysis for print workflows
- Update dependencies for compatibility
- Add detailed README for Hugging Face Space

ProofCheck/README.md CHANGED
@@ -1,117 +1,69 @@
1
  ---
2
  title: PDF Comparison Tool
3
- emoji: πŸ“„
4
  colorFrom: blue
5
  colorTo: purple
6
- sdk: docker
 
 
7
  pinned: false
8
  license: mit
 
9
  ---
10
 
11
- # PDF Comparison Tool
12
 
13
- A comprehensive web-based tool for comparing PDF documents with advanced features including OCR validation, color difference detection, spelling verification, and barcode/QR code detection.
 
 
 
 
14
 
15
- ## πŸš€ Live Demo
16
 
17
- This tool is deployed on Hugging Face Spaces and available for immediate use!
 
 
 
18
 
19
- ## ✨ Features
 
 
 
20
 
21
- - **PDF Validation**: Ensures uploaded PDFs contain "50 Carroll" using OCR
22
- - **Color Difference Detection**: Identifies visual differences between PDFs and highlights them with red boxes
23
- - **Spelling Verification**: Checks text against both English and French dictionaries
24
- - **Barcode/QR Code Detection**: Automatically detects and reads barcodes and QR codes
25
- - **Visual Comparison**: Side-by-side comparison with annotated differences
26
- - **Modern Web Interface**: Responsive design with Bootstrap and custom styling
27
 
28
- ## πŸ“‹ Requirements
 
 
 
29
 
30
- - Both PDF files must contain the text "50 Carroll" for validation
31
- - Maximum file size: 16MB per PDF
32
- - Supported format: PDF only
33
 
34
- ## 🎯 How to Use
 
 
 
35
 
36
- 1. **Upload PDFs**: Select two PDF files for comparison
37
- 2. **Validation**: The tool automatically checks for "50 Carroll" in both documents
38
- 3. **Processing**: Wait for the analysis to complete (may take a few minutes)
39
- 4. **Results**: View findings in three organized tabs:
40
- - **Visual Comparison**: Side-by-side view with red boxes highlighting differences
41
- - **Spelling Issues**: Table of spelling errors with suggestions from English and French dictionaries
42
- - **Barcodes & QR Codes**: List of detected barcodes with their data and positions
43
 
44
- ## πŸ”§ Technical Details
 
 
 
45
 
46
- ### Backend Technologies
47
- - **Python Flask**: Web framework
48
- - **OpenCV**: Image processing and comparison
49
- - **Tesseract OCR**: Text extraction from PDFs
50
- - **scikit-image**: Structural similarity analysis
51
- - **pyspellchecker**: Spelling verification
52
- - **pyzbar**: Barcode and QR code detection
53
 
54
- ### Frontend Technologies
55
- - **HTML5/CSS3**: Modern responsive design
56
- - **JavaScript**: Dynamic content and AJAX requests
57
- - **Bootstrap**: UI framework for professional appearance
 
 
58
 
59
- ### Comparison Algorithms
60
- - **Color Difference**: Uses Structural Similarity Index (SSIM) for pixel-level comparison
61
- - **Text Analysis**: OCR-based text extraction with multi-language spell checking
62
- - **Barcode Detection**: Automatic recognition of various barcode and QR code formats
63
 
64
- ## πŸ› οΈ Local Development
65
-
66
- If you want to run this tool locally:
67
-
68
- ```bash
69
- # Clone the repository
70
- git clone https://huggingface.co/spaces/Digitaljoint/ProofCheck
71
-
72
- # Install dependencies
73
- pip install -r requirements.txt
74
-
75
- # Install Tesseract OCR
76
- # macOS: brew install tesseract
77
- # Ubuntu: sudo apt-get install tesseract-ocr
78
-
79
- # Run the application
80
- python app.py
81
- ```
82
-
83
- ## πŸ“Š Output Examples
84
-
85
- ### Visual Comparison
86
- - Red rectangles highlight color differences between PDFs
87
- - Side-by-side view for easy comparison
88
- - Page-by-page analysis
89
-
90
- ### Spelling Issues
91
- - Word-by-word analysis against English and French dictionaries
92
- - Spelling suggestions for both languages
93
- - Organized table format with original text and corrections
94
-
95
- ### Barcode/QR Code Detection
96
- - Automatic detection of various barcode formats
97
- - Extracted data display
98
- - Position information for each detected code
99
-
100
- ## πŸ”’ Privacy & Security
101
-
102
- - All processing happens locally on the server
103
- - No data is stored permanently
104
- - Files are automatically cleaned up after processing
105
- - No external API calls or data sharing
106
-
107
- ## 🀝 Contributing
108
-
109
- This tool is open source and contributions are welcome! Please feel free to submit issues or pull requests.
110
-
111
- ## πŸ“„ License
112
-
113
- This project is available under the MIT License.
114
-
115
- ---
116
-
117
- **Note**: This tool is specifically designed to validate PDFs containing "50 Carroll" and will reject files that don't contain this text. This ensures that only relevant documents are processed for comparison.
 
1
  ---
2
  title: PDF Comparison Tool
3
+ emoji: πŸ”
4
  colorFrom: blue
5
  colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.44.1
8
+ app_file: pdf_comparator.py
9
  pinned: false
10
  license: mit
11
+ short_description: Advanced PDF comparison tool with visual differences, OCR, barcodes, and CMYK analysis
12
  ---
13
 
14
+ # πŸ” Advanced PDF Comparison Tool
15
 
16
+ Upload two PDF files to get comprehensive analysis including:
17
+ - **Visual differences** with bounding boxes
18
+ - **OCR and spell checking**
19
+ - **Barcode/QR code detection**
20
+ - **CMYK color analysis**
21
 
22
+ ## Features
23
 
24
+ ### Visual Analysis
25
+ - Pixel-level difference detection
26
+ - Bounding box visualization for changes
27
+ - Red overlay highlighting differences
28
 
29
+ ### OCR & Text Analysis
30
+ - Automatic text extraction from PDFs
31
+ - Spell checking with multi-language support
32
+ - Misspelling detection with visual indicators
33
 
34
+ ### Barcode Detection
35
+ - QR code and barcode recognition
36
+ - Multiple symbology support (EAN, UPC, DataBar, etc.)
37
+ - Validation and data extraction
 
 
38
 
39
+ ### Print Workflow Support
40
+ - CMYK color analysis for print workflows
41
+ - Color difference quantification
42
+ - Print-ready color breakdowns
43
 
44
+ ## Usage
 
 
45
 
46
+ 1. Upload two PDF files using the file inputs
47
+ 2. Click "Compare PDF Files" to start analysis
48
+ 3. View results with comprehensive visualizations
49
+ 4. Check barcode detection results in the data tables
50
 
51
+ ## Color Legend
 
 
 
 
 
 
52
 
53
+ - **πŸ”΄ Red boxes:** Visual differences between files
54
+ - **πŸ”΅ Cyan boxes:** Potential spelling errors (OCR)
55
+ - **🟒 Green boxes:** Detected barcodes/QR codes
56
+ - **πŸ“Š Side panel:** CMYK color analysis for print workflows
57
 
58
+ ## Technical Details
 
 
 
 
 
 
59
 
60
+ Built with:
61
+ - Gradio for the web interface
62
+ - OpenCV and PIL for image processing
63
+ - Tesseract for OCR
64
+ - PyZbar for barcode detection
65
+ - Scikit-image for advanced image analysis
66
 
67
+ ## License
 
 
 
68
 
69
+ MIT License - feel free to use and modify for your needs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ProofCheck/pdf_comparator.py CHANGED
The diff for this file is too large to render. See raw diff
 
ProofCheck/requirements.txt CHANGED
@@ -1,20 +1,8 @@
1
- Flask==2.3.3
2
- Werkzeug==2.3.7
3
- PyPDF2==3.0.1
4
- pdf2image==1.16.3
5
- Pillow==10.0.1
6
- opencv-python==4.8.1.78
7
- pytesseract==0.3.10
8
- pyzbar==0.1.9
9
- pyspellchecker==0.7.2
10
- nltk==3.8.1
11
- numpy==1.24.3
12
- scikit-image==0.21.0
13
- matplotlib==3.7.2
14
- pandas==2.0.3
15
- reportlab==4.0.4
16
- python-barcode==0.15.1
17
- zxing-cpp==2.0.0
18
- dbr==9.6.30
19
- PyMuPDF==1.23.8
20
- regex==2023.10.3
 
1
+ gradio>=4.0.0
2
+ pdf2image>=1.16.0
3
+ Pillow>=9.0.0
4
+ pytesseract>=0.3.10
5
+ pyzbar>=0.1.9
6
+ pyspellchecker>=0.7.0
7
+ numpy>=1.21.0
8
+ scikit-image>=0.19.0