kevinconka commited on
Commit
e23e895
·
1 Parent(s): c6db24f

first commit, working app

Browse files
.gitignore ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+ MANIFEST
23
+
24
+ # Virtual environments
25
+ venv/
26
+ env/
27
+ ENV/
28
+ env.bak/
29
+ venv.bak/
30
+
31
+ # Environment variables
32
+ .env
33
+ .env.local
34
+ .env.development.local
35
+ .env.test.local
36
+ .env.production.local
37
+
38
+ # IDE
39
+ .vscode/
40
+ .idea/
41
+ *.swp
42
+ *.swo
43
+ *~
44
+
45
+ # OS
46
+ .DS_Store
47
+ .DS_Store?
48
+ ._*
49
+ .Spotlight-V100
50
+ .Trashes
51
+ ehthumbs.db
52
+ Thumbs.db
53
+
54
+ # Logs
55
+ *.log
56
+
57
+ # Temporary files
58
+ *.tmp
59
+ *.temp
README.md CHANGED
@@ -1,14 +1,105 @@
1
  ---
2
- title: pdf2product
3
- emoji: 🏆
4
  colorFrom: green
5
- colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 5.44.1
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: PoC for “PDF in product out”
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: Document Classification
3
+ emoji: 🏷️
4
  colorFrom: green
5
+ colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.44.1
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: Classify PDF documents into product categories using AI
12
+ ---
13
+
14
+ # Document Classification App
15
+
16
+ A Gradio-based web application that classifies PDF documents into product categories using AI-powered analysis.
17
+
18
+ ## Features
19
+
20
+ - 📄 **PDF Upload**: Upload any PDF document
21
+ - 🔍 **Text Extraction**: Automatically extract text from PDFs
22
+ - 🏷️ **AI-Powered Classification**: Classify documents into product categories
23
+ - 🎯 **Multiple Methods**: Choose between semantic similarity, keyword matching, or hybrid approach
24
+ - 📊 **Confidence Scores**: Get confidence scores for each classification
25
+ - ⚙️ **Customizable Products**: Define your own product categories and descriptions
26
+
27
+ ## How to Use
28
+
29
+ 1. **Upload PDF**: Click the upload button and select your PDF file
30
+ 2. **Choose Method**: Select your preferred classification method (hybrid, semantic, or keyword)
31
+ 3. **Define Products**: Use the default product definitions or customize your own in JSON format
32
+ 4. **Classify**: Click "Classify Document" to analyze the PDF
33
+ 5. **View Results**: See the top 3 product matches with confidence scores
34
+
35
+ ## Setup
36
+
37
+ ### For Hugging Face Spaces (Production)
38
+
39
+ 1. Set your `OPENAI_API_KEY` in the Space settings:
40
+ - Go to your Space settings
41
+ - Add `OPENAI_API_KEY` as a secret
42
+ - Enter your OpenAI API key
43
+
44
+ ### For Local Development
45
+
46
+ 1. Clone this repository
47
+ 2. Install dependencies: `pip install -r requirements.txt`
48
+ 3. Optionally create a `.env` file in the project root with your API key:
49
+ ```
50
+ OPENAI_API_KEY=your-api-key-here
51
+ ```
52
+ Or set it as an environment variable: `export OPENAI_API_KEY="your-api-key"`
53
+ 4. Run the app: `python app.py`
54
+
55
+ ## Technical Details
56
+
57
+ - **Framework**: Gradio for the web interface
58
+ - **Embeddings**: OpenAI embeddings for semantic similarity
59
+ - **Vector Store**: LangChain InMemoryVectorStore for efficient similarity search
60
+ - **Classification Methods**: Semantic similarity, keyword matching, and hybrid approach
61
+ - **Text Processing**: PyPDF for PDF text extraction
62
+ - **Architecture**: Simple modular design with clean separation of concerns
63
+
64
+ ## Project Structure
65
+
66
+ ```
67
+ pdf2product/
68
+ ├── app.py # Main Gradio application
69
+ ├── requirements.txt # Python dependencies
70
+ ├── .env # Environment variables (create this)
71
+ └── pdf_qa/ # Core Q&A package
72
+ ├── __init__.py # Package initialization
73
+ ├── pdf_processor.py # PDF text extraction and chunking
74
+ └── qa_engine.py # Question answering engine
75
+ ```
76
+
77
+ ## Architecture
78
+
79
+ Simple and clean architecture:
80
+
81
+ - **Separation of Concerns**: UI logic (Gradio) is separate from business logic
82
+ - **Modularity**: Two main components - PDF processing and Q&A
83
+ - **Simplicity**: Minimal, focused modules that do one thing well
84
+
85
+ ## Example Product Categories
86
+
87
+ The app includes several example product configurations:
88
+
89
+ - **Invoice-Focused**: Invoice, Receipt, Quote/Estimate
90
+ - **Travel-Focused**: Flight Ticket, Hotel Reservation, Travel Insurance
91
+ - **Employment-Focused**: CV/Resume, Job Offer, Employment Contract
92
+
93
+ Users can also define their own custom product categories in JSON format.
94
+
95
+ ## Limitations
96
+
97
+ - Currently supports one PDF at a time
98
+ - Requires OpenAI API key
99
+ - Best results with text-based PDFs (not scanned images)
100
+ - Processing time depends on document size
101
+ - Classification accuracy depends on document content quality
102
+
103
  ---
104
 
105
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import json
3
+ from pdf_qa.pdf_processor import PDFProcessor
4
+ from pdf_qa.product_classifier import ProductClassifier
5
+
6
+ # Load .env file only in development (optional)
7
+ try:
8
+ from dotenv import load_dotenv
9
+
10
+ load_dotenv()
11
+ except ImportError:
12
+ pass # dotenv not available in production
13
+
14
+ # Global instances
15
+ pdf_processor = PDFProcessor()
16
+ # Initialize classifier with empty products (will be set during classification)
17
+ classifier = None
18
+
19
+
20
+ def classify_document(pdf_file, products_json, method):
21
+ """Classify document into product categories"""
22
+ if not pdf_file:
23
+ return "Please upload a PDF file first.", ""
24
+
25
+ if not products_json.strip():
26
+ return "Please provide product definitions.", ""
27
+
28
+ try:
29
+ # Parse products JSON
30
+ products = json.loads(products_json)
31
+
32
+ # Process PDF
33
+ pages = pdf_processor.process_pdf(pdf_file)
34
+
35
+ # Create classifier with products
36
+ classifier = ProductClassifier(products)
37
+
38
+ # Classify document
39
+ results = classifier.classify_document(pages, products, method)
40
+
41
+ # Format results for Gradio Label component
42
+ formatted_results = {}
43
+ for product_id, score in results[:3]: # Top 3 results
44
+ product_name = products[product_id].get("name", product_id)
45
+ formatted_results[product_name] = score
46
+
47
+ # Get summary if using smart_semantic method
48
+ summary = ""
49
+ if method == "smart_semantic":
50
+ summary = classifier.get_summary(pages)
51
+
52
+ return formatted_results, summary
53
+
54
+ except json.JSONDecodeError:
55
+ return "Invalid JSON format for products.", ""
56
+ except Exception as e:
57
+ return str(e), ""
58
+
59
+
60
+ # Create Gradio interface
61
+ with gr.Blocks(title="Document Classification", theme=gr.themes.Soft()) as demo:
62
+ gr.Markdown("# 📄 Document Classification")
63
+
64
+ # Details section
65
+ with gr.Accordion("ℹ️ How it works", open=False):
66
+ gr.Markdown("""
67
+ **Document Classification System**
68
+
69
+ This AI-powered tool analyzes PDF documents and matches them to predefined product categories based on content similarity.
70
+
71
+ **Methods Available:**
72
+ - **Smart Semantic**: Uses LLM to summarize the document, then finds semantic matches (recommended)
73
+ - **Semantic**: Direct semantic similarity between document and product descriptions
74
+ - **Keyword**: Matches based on keyword presence in the document
75
+ - **Hybrid**: Combines semantic and keyword approaches (70% semantic, 30% keyword)
76
+
77
+ **How to use:**
78
+ 1. Upload a PDF document
79
+ 2. Define your product categories with descriptions and keywords (JSON format) or use the examples at the bottom of the page
80
+ 3. Choose a classification method
81
+ 4. Get top 3 matches with confidence scores
82
+ """)
83
+
84
+ with gr.Row():
85
+ with gr.Column(scale=1):
86
+ gr.Markdown("### Upload PDF")
87
+ pdf_input = gr.File(
88
+ label="Upload PDF", file_types=[".pdf"], type="filepath"
89
+ )
90
+
91
+ gr.Markdown("### Classification Method")
92
+ method_dropdown = gr.Dropdown(
93
+ choices=["hybrid", "smart_semantic", "semantic", "keyword"],
94
+ value="smart_semantic",
95
+ label="Select classification method",
96
+ )
97
+
98
+ classify_btn = gr.Button("Classify Document", variant="primary")
99
+
100
+ with gr.Column(scale=2):
101
+ gr.Markdown("### Product Definitions")
102
+ products_input = gr.Textbox(
103
+ label="Product definitions (JSON format)",
104
+ value="{}",
105
+ # lines=19,
106
+ placeholder="Enter product definitions in JSON format or use examples below...",
107
+ )
108
+
109
+ with gr.Row():
110
+ with gr.Column():
111
+ gr.Markdown("### Classification Results")
112
+ results_output = gr.Label(label="Top 3 matches with confidence scores")
113
+
114
+ with gr.Column():
115
+ gr.Markdown("### Document Summary")
116
+ summary_output = gr.Textbox(
117
+ label="LLM-generated summary (smart_semantic method only)",
118
+ lines=3,
119
+ interactive=False
120
+ )
121
+
122
+ with gr.Accordion("Examples", open=True):
123
+ gr.Markdown("""
124
+ Below are some example product definitions grouped by different categories.
125
+ - Invoice-focused products
126
+ - Travel-focused products
127
+ - Employment-focused products
128
+
129
+ ### How to use examples:
130
+ 1. Click on any example below.
131
+ 2. Upload your PDF document
132
+ 3. Choose your classification method
133
+ 4. Click "Classify Document"
134
+ """)
135
+
136
+ # Example 1: Invoice-focused products
137
+ invoice_products = {
138
+ "invoice": {
139
+ "name": "Invoice",
140
+ "description": "A commercial document requesting payment for goods or services rendered. Contains billing information, itemized charges, tax amounts, payment terms, due dates, vendor details, and total amounts owed.",
141
+ "keywords": [
142
+ "invoice",
143
+ "bill",
144
+ "payment",
145
+ "amount",
146
+ "due",
147
+ "tax",
148
+ "total",
149
+ "vendor",
150
+ "customer",
151
+ "charges",
152
+ ],
153
+ },
154
+ "receipt": {
155
+ "name": "Receipt",
156
+ "description": "A proof of payment document showing completed transaction details, payment confirmation, and purchase information.",
157
+ "keywords": [
158
+ "receipt",
159
+ "payment",
160
+ "transaction",
161
+ "purchase",
162
+ "paid",
163
+ "confirmation",
164
+ "total",
165
+ ],
166
+ },
167
+ "quote": {
168
+ "name": "Quote/Estimate",
169
+ "description": "A preliminary pricing document providing cost estimates for goods or services before purchase.",
170
+ "keywords": [
171
+ "quote",
172
+ "estimate",
173
+ "pricing",
174
+ "cost",
175
+ "proposal",
176
+ "preliminary",
177
+ "before purchase",
178
+ ],
179
+ },
180
+ }
181
+
182
+ # Example 2: Travel-focused products
183
+ travel_products = {
184
+ "flight_ticket": {
185
+ "name": "Flight Ticket",
186
+ "description": "Airline ticket or booking confirmation with passenger details, flight information, and travel itinerary.",
187
+ "keywords": [
188
+ "flight",
189
+ "airline",
190
+ "ticket",
191
+ "booking",
192
+ "passenger",
193
+ "itinerary",
194
+ "departure",
195
+ "arrival",
196
+ ],
197
+ },
198
+ "hotel_reservation": {
199
+ "name": "Hotel Reservation",
200
+ "description": "Hotel booking confirmation with accommodation details, check-in/out dates, and room information.",
201
+ "keywords": [
202
+ "hotel",
203
+ "reservation",
204
+ "booking",
205
+ "accommodation",
206
+ "check-in",
207
+ "check-out",
208
+ "room",
209
+ ],
210
+ },
211
+ "travel_insurance": {
212
+ "name": "Travel Insurance",
213
+ "description": "Insurance policy document covering travel-related risks, coverage details, and policy terms.",
214
+ "keywords": [
215
+ "insurance",
216
+ "policy",
217
+ "coverage",
218
+ "travel",
219
+ "risk",
220
+ "terms",
221
+ "protection",
222
+ ],
223
+ },
224
+ }
225
+
226
+ # Example 3: Employment-focused products
227
+ employment_products = {
228
+ "cv_resume": {
229
+ "name": "CV/Resume",
230
+ "description": "Document summarizing education, work experience, skills, and qualifications for employment.",
231
+ "keywords": [
232
+ "resume",
233
+ "cv",
234
+ "experience",
235
+ "education",
236
+ "skills",
237
+ "employment",
238
+ "qualifications",
239
+ ],
240
+ },
241
+ "job_offer": {
242
+ "name": "Job Offer",
243
+ "description": "Employment offer letter with position details, salary, benefits, and employment terms.",
244
+ "keywords": [
245
+ "job offer",
246
+ "employment",
247
+ "position",
248
+ "salary",
249
+ "benefits",
250
+ "terms",
251
+ "offer letter",
252
+ ],
253
+ },
254
+ "employment_contract": {
255
+ "name": "Employment Contract",
256
+ "description": "Legal employment agreement with terms, conditions, responsibilities, and employment rights.",
257
+ "keywords": [
258
+ "contract",
259
+ "employment",
260
+ "terms",
261
+ "conditions",
262
+ "responsibilities",
263
+ "rights",
264
+ "agreement",
265
+ ],
266
+ },
267
+ }
268
+
269
+ gr.Examples(
270
+ examples=[
271
+ [
272
+ None, # No PDF file
273
+ json.dumps(invoice_products, indent=2),
274
+ "smart_semantic",
275
+ ],
276
+ [
277
+ None, # No PDF file
278
+ json.dumps(travel_products, indent=2),
279
+ "smart_semantic",
280
+ ],
281
+ [
282
+ None, # No PDF file
283
+ json.dumps(employment_products, indent=2),
284
+ "smart_semantic",
285
+ ],
286
+ ],
287
+ inputs=[pdf_input, products_input, method_dropdown],
288
+ label="Product Definition Examples",
289
+ )
290
+
291
+ # Set up event handlers
292
+ classify_btn.click(
293
+ fn=classify_document,
294
+ inputs=[pdf_input, products_input, method_dropdown],
295
+ outputs=[results_output, summary_output],
296
+ )
297
+
298
+ if __name__ == "__main__":
299
+ demo.launch()
pdf_qa/__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PDF Question Answering Package
3
+
4
+ A modular package for processing PDFs and answering questions using AI.
5
+ """
6
+
7
+ __version__ = "1.0.0"
8
+ __author__ = "PDF QA Team"
pdf_qa/pdf_processor.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PDF Processing Module
3
+
4
+ Simple PDF text extraction and chunking.
5
+ """
6
+
7
+ from typing import List
8
+ from langchain_community.document_loaders import PyPDFLoader
9
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
10
+ from langchain_core.documents import Document
11
+
12
+
13
+ class PDFProcessor:
14
+ """Simple PDF text extraction and chunking."""
15
+
16
+ def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200) -> None:
17
+ self.text_splitter = RecursiveCharacterTextSplitter(
18
+ chunk_size=chunk_size,
19
+ chunk_overlap=chunk_overlap,
20
+ length_function=len,
21
+ )
22
+
23
+ def process_pdf(self, pdf_file: str) -> List[Document]:
24
+ """Extract text from PDF and split into chunks."""
25
+ # Load and extract text
26
+ loader = PyPDFLoader(pdf_file)
27
+ pages = []
28
+ for page in loader.lazy_load():
29
+ pages.append(page)
30
+
31
+ return pages
pdf_qa/product_classifier.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Document Classification Module
3
+
4
+ Classifies PDF documents into product categories using different methods.
5
+ """
6
+
7
+ import json
8
+ from typing import List, Tuple, Dict
9
+ from langchain_openai.embeddings import OpenAIEmbeddings
10
+ from langchain_openai.chat_models import ChatOpenAI
11
+ from langchain_core.vectorstores import InMemoryVectorStore
12
+
13
+
14
+ class ProductClassifier:
15
+ """Classifies documents into product categories."""
16
+
17
+ def __init__(self, products, temperature=0):
18
+ self.products = products
19
+ self.temperature = temperature
20
+ self.embeddings = OpenAIEmbeddings()
21
+ self.llm = ChatOpenAI(temperature=temperature)
22
+
23
+ def _extract_text_from_documents(self, documents: List) -> str:
24
+ """Extract text from a list of Document objects."""
25
+ if not documents:
26
+ return ""
27
+
28
+ # Combine all document content
29
+ text_parts = []
30
+ for doc in documents:
31
+ if hasattr(doc, "page_content"):
32
+ text_parts.append(doc.page_content)
33
+ elif isinstance(doc, str):
34
+ text_parts.append(doc)
35
+ else:
36
+ # Handle other document types
37
+ text_parts.append(str(doc))
38
+
39
+ return "\n".join(text_parts)
40
+
41
+ def semantic_similarity_classification(
42
+ self, documents: List, products: Dict
43
+ ) -> List[Tuple[str, float]]:
44
+ """Classify using semantic similarity with embeddings."""
45
+ # Create vector store from documents
46
+ vector_store = InMemoryVectorStore.from_documents(documents, self.embeddings)
47
+
48
+ # Calculate similarities with each product
49
+ similarities = []
50
+ for product_id, product_info in products.items():
51
+ # Search for similar content using product description
52
+ similar_docs = vector_store.similarity_search_with_score(
53
+ json.dumps(product_info), k=1
54
+ )
55
+
56
+ # Get similarity score (higher is better)
57
+ if similar_docs:
58
+ similarity = similar_docs[0][1] # Direct similarity score
59
+ else:
60
+ similarity = 0
61
+
62
+ similarities.append((product_id, similarity))
63
+
64
+ # Sort by similarity (highest first)
65
+ similarities.sort(key=lambda x: x[1], reverse=True)
66
+ return similarities
67
+
68
+ def smart_semantic_classification(
69
+ self, documents: List, products: Dict
70
+ ) -> List[Tuple[str, float]]:
71
+ """Classify using LLM summarization + semantic similarity."""
72
+ # Generate summary
73
+ summary = self._generate_summary(self._extract_text_from_documents(documents))
74
+
75
+ # Create vector store with summary
76
+ vector_store = InMemoryVectorStore.from_texts([summary], self.embeddings)
77
+
78
+ # Calculate similarities with each product
79
+ similarities = []
80
+ for product_id, product_info in products.items():
81
+ # Search for similar content using product description
82
+ similar_docs = vector_store.similarity_search_with_score(
83
+ product_info["description"], k=1
84
+ )
85
+
86
+ # Get similarity score (higher is better)
87
+ if similar_docs:
88
+ similarity = similar_docs[0][1] # Direct similarity score
89
+ else:
90
+ similarity = 0
91
+
92
+ similarities.append((product_id, similarity))
93
+
94
+ # Sort by similarity (highest first)
95
+ similarities.sort(key=lambda x: x[1], reverse=True)
96
+ return similarities
97
+
98
+ def _generate_summary(self, text: str) -> str:
99
+ """Generate a summary of the document text."""
100
+ prompt = f"""Summarize the following document, focusing on the main type and purpose of the document:
101
+
102
+ {text}...
103
+
104
+ Summary:"""
105
+
106
+ try:
107
+ response = self.llm.invoke(prompt)
108
+ return response.content.strip()
109
+ except Exception:
110
+ # Fallback to first 500 characters if summarization fails
111
+ return text[:1000]
112
+
113
+ def keyword_matching_classification(
114
+ self, documents: List, products: Dict
115
+ ) -> List[Tuple[str, float]]:
116
+ """Classify using keyword matching."""
117
+ # Extract text from documents
118
+ document_text = self._extract_text_from_documents(documents)
119
+ document_text_lower = document_text.lower()
120
+
121
+ scores = []
122
+ for product_id, product_info in products.items():
123
+ keywords = product_info["keywords"]
124
+ matches = sum(
125
+ 1 for keyword in keywords if keyword.lower() in document_text_lower
126
+ )
127
+
128
+ # Calculate score based on keyword matches
129
+ score = matches / len(keywords) if keywords else 0
130
+ scores.append((product_id, score))
131
+
132
+ # Sort by score (highest first)
133
+ scores.sort(key=lambda x: x[1], reverse=True)
134
+ return scores
135
+
136
+ def hybrid_classification(
137
+ self, documents: List, products: Dict
138
+ ) -> List[Tuple[str, float]]:
139
+ """Classify using both semantic similarity and keyword matching."""
140
+ semantic_results = self.semantic_similarity_classification(documents, products)
141
+ keyword_results = self.keyword_matching_classification(documents, products)
142
+
143
+ # Combine scores (70% semantic, 30% keyword)
144
+ combined_scores = {}
145
+ for product_id, semantic_score in semantic_results:
146
+ keyword_score = next(
147
+ (score for pid, score in keyword_results if pid == product_id), 0
148
+ )
149
+ combined_score = 0.7 * semantic_score + 0.3 * keyword_score
150
+ combined_scores[product_id] = combined_score
151
+
152
+ # Sort by combined score
153
+ sorted_results = sorted(
154
+ combined_scores.items(), key=lambda x: x[1], reverse=True
155
+ )
156
+ return sorted_results
157
+
158
+ def classify_document(
159
+ self, documents: List, products: Dict, method: str = "hybrid"
160
+ ) -> List[Tuple[str, float]]:
161
+ """Classify document using specified method."""
162
+ if method == "semantic":
163
+ return self.semantic_similarity_classification(documents, products)
164
+ if method == "smart_semantic":
165
+ return self.smart_semantic_classification(documents, products)
166
+ if method == "keyword":
167
+ return self.keyword_matching_classification(documents, products)
168
+ if method == "hybrid":
169
+ return self.hybrid_classification(documents, products)
170
+ raise ValueError(f"Unknown classification method: {method}")
171
+
172
+ def get_summary(self, documents: List) -> str:
173
+ """Get document summary for display."""
174
+ return self._generate_summary(self._extract_text_from_documents(documents))
175
+
176
+ def get_product_info(self, product_id: str) -> Dict:
177
+ """Get product information by ID."""
178
+ return self.products.get(product_id, {})
pdf_qa/qa_engine.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Question Answering Engine
3
+
4
+ Simple Q&A using LangChain and OpenAI.
5
+ """
6
+
7
+ import os
8
+ from langchain_openai.chat_models import ChatOpenAI
9
+ from langchain_openai.embeddings import OpenAIEmbeddings
10
+ from langchain_community.vectorstores import FAISS
11
+ from langchain_core.runnables import RunnablePassthrough
12
+ from langchain_core.output_parsers import StrOutputParser
13
+ from langchain_core.prompts import ChatPromptTemplate
14
+
15
+
16
+ class QAEngine:
17
+ """Simple question answering engine."""
18
+
19
+ def __init__(self, temperature=0):
20
+ self.temperature = temperature
21
+ self.retriever = None
22
+ self.llm = None
23
+
24
+ def setup(self, chunks):
25
+ """Setup QA chain with document chunks."""
26
+ if not os.getenv("OPENAI_API_KEY"):
27
+ raise ValueError("OPENAI_API_KEY not set")
28
+
29
+ # Create vector store
30
+ embeddings = OpenAIEmbeddings()
31
+ vector_store = FAISS.from_texts(chunks, embeddings)
32
+
33
+ # Setup retriever and LLM
34
+ self.retriever = vector_store.as_retriever(search_kwargs={"k": 3})
35
+ self.llm = ChatOpenAI(temperature=self.temperature)
36
+
37
+ def ask(self, question):
38
+ """Ask a question about the document."""
39
+ if not self.retriever or not self.llm:
40
+ raise ValueError("Please process a PDF first")
41
+
42
+ if not question.strip():
43
+ raise ValueError("Please enter a question")
44
+
45
+ # Create prompt template
46
+ template = """Answer the question based on the following context:
47
+
48
+ Context: {context}
49
+
50
+ Question: {question}
51
+
52
+ Answer:"""
53
+
54
+ prompt = ChatPromptTemplate.from_template(template)
55
+
56
+ # Create chain
57
+ chain = (
58
+ {"context": self.retriever, "question": RunnablePassthrough()}
59
+ | prompt
60
+ | self.llm
61
+ | StrOutputParser()
62
+ )
63
+
64
+ return chain.invoke(question)
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ langchain
3
+ langchain-openai
4
+ langchain-community
5
+ langchain-core
6
+ langchain-text-splitters
7
+ openai
8
+ faiss-cpu
9
+ pypdf
10
+ python-dotenv
test_simple.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple test for the Document Classification application
4
+ """
5
+
6
+ import os
7
+ import sys
8
+
9
+ # Load .env file only in development (optional)
10
+ try:
11
+ from dotenv import load_dotenv
12
+ load_dotenv()
13
+ except ImportError:
14
+ pass # dotenv not available in production
15
+
16
+ def test_imports():
17
+ """Test that all modules can be imported."""
18
+ print("🧪 Testing Document Classification Structure")
19
+ print("=" * 50)
20
+
21
+ try:
22
+ from pdf_qa.pdf_processor import PDFProcessor
23
+ from pdf_qa.product_classifier import ProductClassifier
24
+ print("✅ All modules imported successfully")
25
+
26
+ # Test initialization
27
+ pdf_processor = PDFProcessor()
28
+ classifier = ProductClassifier()
29
+ print("✅ Components initialized successfully")
30
+
31
+ # Test classification methods
32
+ test_products = {
33
+ "test": {
34
+ "name": "Test Product",
35
+ "description": "A test product for classification",
36
+ "keywords": ["test", "product"]
37
+ }
38
+ }
39
+
40
+ # Test classifier initialization
41
+ classifier = ProductClassifier(test_products)
42
+ print("✅ Classification methods available")
43
+
44
+ # Check API key
45
+ if not os.getenv("OPENAI_API_KEY"):
46
+ print("⚠️ OPENAI_API_KEY not set (expected for testing)")
47
+ else:
48
+ print("✅ OPENAI_API_KEY found")
49
+
50
+ print("\n🎉 Document classification structure working correctly!")
51
+ print("\nTo run the app:")
52
+ print("1. Set OPENAI_API_KEY environment variable")
53
+ print("2. Run: python app.py")
54
+
55
+ return True
56
+
57
+ except Exception as e:
58
+ print(f"❌ Error: {str(e)}")
59
+ return False
60
+
61
+ if __name__ == "__main__":
62
+ success = test_imports()
63
+ sys.exit(0 if success else 1)