Spaces:

kevinconka
/

pdf2product

Sleeping

App Files Files Community

kevinconka commited on Aug 29

Commit

e23e895

1 Parent(s): c6db24f

first commit, working app

Browse files

Files changed (9) hide show

.gitignore +59 -0
README.md +95 -4
app.py +299 -0
pdf_qa/__init__.py +8 -0
pdf_qa/pdf_processor.py +31 -0
pdf_qa/product_classifier.py +178 -0
pdf_qa/qa_engine.py +64 -0
requirements.txt +10 -0
test_simple.py +63 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,59 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Virtual environments
+venv/
+env/
+ENV/
+env.bak/
+venv.bak/
+# Environment variables
+.env
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Logs
+*.log
+# Temporary files
+*.tmp
+*.temp

README.md CHANGED Viewed

@@ -1,14 +1,105 @@
 ---
-title: pdf2product
-emoji: 🏆
 colorFrom: green
-colorTo: yellow
 sdk: gradio
 sdk_version: 5.44.1
 app_file: app.py
 pinned: false
 license: mit
-short_description: PoC for “PDF in → product out”
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Document Classification
+emoji: 🏷️
 colorFrom: green
+colorTo: blue
 sdk: gradio
 sdk_version: 5.44.1
 app_file: app.py
 pinned: false
 license: mit
+short_description: Classify PDF documents into product categories using AI
+---
+# Document Classification App
+A Gradio-based web application that classifies PDF documents into product categories using AI-powered analysis.
+## Features
+- 📄 **PDF Upload**: Upload any PDF document
+- 🔍 **Text Extraction**: Automatically extract text from PDFs
+- 🏷️ **AI-Powered Classification**: Classify documents into product categories
+- 🎯 **Multiple Methods**: Choose between semantic similarity, keyword matching, or hybrid approach
+- 📊 **Confidence Scores**: Get confidence scores for each classification
+- ⚙️ **Customizable Products**: Define your own product categories and descriptions
+## How to Use
+1. **Upload PDF**: Click the upload button and select your PDF file
+2. **Choose Method**: Select your preferred classification method (hybrid, semantic, or keyword)
+3. **Define Products**: Use the default product definitions or customize your own in JSON format
+4. **Classify**: Click "Classify Document" to analyze the PDF
+5. **View Results**: See the top 3 product matches with confidence scores
+## Setup
+### For Hugging Face Spaces (Production)
+1. Set your `OPENAI_API_KEY` in the Space settings:
+   - Go to your Space settings
+   - Add `OPENAI_API_KEY` as a secret
+   - Enter your OpenAI API key
+### For Local Development
+1. Clone this repository
+2. Install dependencies: `pip install -r requirements.txt`
+3. Optionally create a `.env` file in the project root with your API key:
+   ```
+   OPENAI_API_KEY=your-api-key-here
+   ```
+   Or set it as an environment variable: `export OPENAI_API_KEY="your-api-key"`
+4. Run the app: `python app.py`
+## Technical Details
+- **Framework**: Gradio for the web interface
+- **Embeddings**: OpenAI embeddings for semantic similarity
+- **Vector Store**: LangChain InMemoryVectorStore for efficient similarity search
+- **Classification Methods**: Semantic similarity, keyword matching, and hybrid approach
+- **Text Processing**: PyPDF for PDF text extraction
+- **Architecture**: Simple modular design with clean separation of concerns
+## Project Structure
+```
+pdf2product/
+├── app.py                 # Main Gradio application
+├── requirements.txt       # Python dependencies
+├── .env                   # Environment variables (create this)
+└── pdf_qa/               # Core Q&A package
+    ├── __init__.py       # Package initialization
+    ├── pdf_processor.py  # PDF text extraction and chunking
+    └── qa_engine.py      # Question answering engine
+```
+## Architecture
+Simple and clean architecture:
+- **Separation of Concerns**: UI logic (Gradio) is separate from business logic
+- **Modularity**: Two main components - PDF processing and Q&A
+- **Simplicity**: Minimal, focused modules that do one thing well
+## Example Product Categories
+The app includes several example product configurations:
+- **Invoice-Focused**: Invoice, Receipt, Quote/Estimate
+- **Travel-Focused**: Flight Ticket, Hotel Reservation, Travel Insurance
+- **Employment-Focused**: CV/Resume, Job Offer, Employment Contract
+Users can also define their own custom product categories in JSON format.
+## Limitations
+- Currently supports one PDF at a time
+- Requires OpenAI API key
+- Best results with text-based PDFs (not scanned images)
+- Processing time depends on document size
+- Classification accuracy depends on document content quality
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,299 @@

+import gradio as gr
+import json
+from pdf_qa.pdf_processor import PDFProcessor
+from pdf_qa.product_classifier import ProductClassifier
+# Load .env file only in development (optional)
+try:
+    from dotenv import load_dotenv
+    load_dotenv()
+except ImportError:
+    pass  # dotenv not available in production
+# Global instances
+pdf_processor = PDFProcessor()
+# Initialize classifier with empty products (will be set during classification)
+classifier = None
+def classify_document(pdf_file, products_json, method):
+    """Classify document into product categories"""
+    if not pdf_file:
+        return "Please upload a PDF file first.", ""
+    if not products_json.strip():
+        return "Please provide product definitions.", ""
+    try:
+        # Parse products JSON
+        products = json.loads(products_json)
+        # Process PDF
+        pages = pdf_processor.process_pdf(pdf_file)
+        # Create classifier with products
+        classifier = ProductClassifier(products)
+        # Classify document
+        results = classifier.classify_document(pages, products, method)
+        # Format results for Gradio Label component
+        formatted_results = {}
+        for product_id, score in results[:3]:  # Top 3 results
+            product_name = products[product_id].get("name", product_id)
+            formatted_results[product_name] = score
+        # Get summary if using smart_semantic method
+        summary = ""
+        if method == "smart_semantic":
+            summary = classifier.get_summary(pages)
+        return formatted_results, summary
+    except json.JSONDecodeError:
+        return "Invalid JSON format for products.", ""
+    except Exception as e:
+        return str(e), ""
+# Create Gradio interface
+with gr.Blocks(title="Document Classification", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# 📄 Document Classification")
+    # Details section
+    with gr.Accordion("ℹ️ How it works", open=False):
+        gr.Markdown("""
+        **Document Classification System**
+        This AI-powered tool analyzes PDF documents and matches them to predefined product categories based on content similarity.
+        **Methods Available:**
+        - **Smart Semantic**: Uses LLM to summarize the document, then finds semantic matches (recommended)
+        - **Semantic**: Direct semantic similarity between document and product descriptions
+        - **Keyword**: Matches based on keyword presence in the document
+        - **Hybrid**: Combines semantic and keyword approaches (70% semantic, 30% keyword)
+        **How to use:**
+        1. Upload a PDF document
+        2. Define your product categories with descriptions and keywords (JSON format) or use the examples at the bottom of the page
+        3. Choose a classification method
+        4. Get top 3 matches with confidence scores
+        """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### Upload PDF")
+            pdf_input = gr.File(
+                label="Upload PDF", file_types=[".pdf"], type="filepath"
+            )
+            gr.Markdown("### Classification Method")
+            method_dropdown = gr.Dropdown(
+                choices=["hybrid", "smart_semantic", "semantic", "keyword"],
+                value="smart_semantic",
+                label="Select classification method",
+            )
+            classify_btn = gr.Button("Classify Document", variant="primary")
+        with gr.Column(scale=2):
+            gr.Markdown("### Product Definitions")
+            products_input = gr.Textbox(
+                label="Product definitions (JSON format)",
+                value="{}",
+                # lines=19,
+                placeholder="Enter product definitions in JSON format or use examples below...",
+            )
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### Classification Results")
+            results_output = gr.Label(label="Top 3 matches with confidence scores")
+        with gr.Column():
+            gr.Markdown("### Document Summary")
+            summary_output = gr.Textbox(
+                label="LLM-generated summary (smart_semantic method only)",
+                lines=3,
+                interactive=False
+            )
+    with gr.Accordion("Examples", open=True):
+        gr.Markdown("""
+        Below are some example product definitions grouped by different categories.
+        - Invoice-focused products
+        - Travel-focused products
+        - Employment-focused products
+        ### How to use examples:
+        1. Click on any example below.
+        2. Upload your PDF document
+        3. Choose your classification method
+        4. Click "Classify Document"
+        """)
+        # Example 1: Invoice-focused products
+        invoice_products = {
+            "invoice": {
+                "name": "Invoice",
+                "description": "A commercial document requesting payment for goods or services rendered. Contains billing information, itemized charges, tax amounts, payment terms, due dates, vendor details, and total amounts owed.",
+                "keywords": [
+                    "invoice",
+                    "bill",
+                    "payment",
+                    "amount",
+                    "due",
+                    "tax",
+                    "total",
+                    "vendor",
+                    "customer",
+                    "charges",
+                ],
+            },
+            "receipt": {
+                "name": "Receipt",
+                "description": "A proof of payment document showing completed transaction details, payment confirmation, and purchase information.",
+                "keywords": [
+                    "receipt",
+                    "payment",
+                    "transaction",
+                    "purchase",
+                    "paid",
+                    "confirmation",
+                    "total",
+                ],
+            },
+            "quote": {
+                "name": "Quote/Estimate",
+                "description": "A preliminary pricing document providing cost estimates for goods or services before purchase.",
+                "keywords": [
+                    "quote",
+                    "estimate",
+                    "pricing",
+                    "cost",
+                    "proposal",
+                    "preliminary",
+                    "before purchase",
+                ],
+            },
+        }
+        # Example 2: Travel-focused products
+        travel_products = {
+            "flight_ticket": {
+                "name": "Flight Ticket",
+                "description": "Airline ticket or booking confirmation with passenger details, flight information, and travel itinerary.",
+                "keywords": [
+                    "flight",
+                    "airline",
+                    "ticket",
+                    "booking",
+                    "passenger",
+                    "itinerary",
+                    "departure",
+                    "arrival",
+                ],
+            },
+            "hotel_reservation": {
+                "name": "Hotel Reservation",
+                "description": "Hotel booking confirmation with accommodation details, check-in/out dates, and room information.",
+                "keywords": [
+                    "hotel",
+                    "reservation",
+                    "booking",
+                    "accommodation",
+                    "check-in",
+                    "check-out",
+                    "room",
+                ],
+            },
+            "travel_insurance": {
+                "name": "Travel Insurance",
+                "description": "Insurance policy document covering travel-related risks, coverage details, and policy terms.",
+                "keywords": [
+                    "insurance",
+                    "policy",
+                    "coverage",
+                    "travel",
+                    "risk",
+                    "terms",
+                    "protection",
+                ],
+            },
+        }
+        # Example 3: Employment-focused products
+        employment_products = {
+            "cv_resume": {
+                "name": "CV/Resume",
+                "description": "Document summarizing education, work experience, skills, and qualifications for employment.",
+                "keywords": [
+                    "resume",
+                    "cv",
+                    "experience",
+                    "education",
+                    "skills",
+                    "employment",
+                    "qualifications",
+                ],
+            },
+            "job_offer": {
+                "name": "Job Offer",
+                "description": "Employment offer letter with position details, salary, benefits, and employment terms.",
+                "keywords": [
+                    "job offer",
+                    "employment",
+                    "position",
+                    "salary",
+                    "benefits",
+                    "terms",
+                    "offer letter",
+                ],
+            },
+            "employment_contract": {
+                "name": "Employment Contract",
+                "description": "Legal employment agreement with terms, conditions, responsibilities, and employment rights.",
+                "keywords": [
+                    "contract",
+                    "employment",
+                    "terms",
+                    "conditions",
+                    "responsibilities",
+                    "rights",
+                    "agreement",
+                ],
+            },
+        }
+        gr.Examples(
+            examples=[
+                [
+                    None,  # No PDF file
+                    json.dumps(invoice_products, indent=2),
+                    "smart_semantic",
+                ],
+                [
+                    None,  # No PDF file
+                    json.dumps(travel_products, indent=2),
+                    "smart_semantic",
+                ],
+                [
+                    None,  # No PDF file
+                    json.dumps(employment_products, indent=2),
+                    "smart_semantic",
+                ],
+            ],
+            inputs=[pdf_input, products_input, method_dropdown],
+            label="Product Definition Examples",
+        )
+    # Set up event handlers
+    classify_btn.click(
+        fn=classify_document,
+        inputs=[pdf_input, products_input, method_dropdown],
+        outputs=[results_output, summary_output],
+    )
+if __name__ == "__main__":
+    demo.launch()

pdf_qa/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""
+PDF Question Answering Package
+A modular package for processing PDFs and answering questions using AI.
+"""
+__version__ = "1.0.0"
+__author__ = "PDF QA Team"

pdf_qa/pdf_processor.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+PDF Processing Module
+Simple PDF text extraction and chunking.
+"""
+from typing import List
+from langchain_community.document_loaders import PyPDFLoader
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_core.documents import Document
+class PDFProcessor:
+    """Simple PDF text extraction and chunking."""
+    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200) -> None:
+        self.text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
+            length_function=len,
+        )
+    def process_pdf(self, pdf_file: str) -> List[Document]:
+        """Extract text from PDF and split into chunks."""
+        # Load and extract text
+        loader = PyPDFLoader(pdf_file)
+        pages = []
+        for page in loader.lazy_load():
+            pages.append(page)
+        return pages

pdf_qa/product_classifier.py ADDED Viewed

	@@ -0,0 +1,178 @@

+"""
+Document Classification Module
+Classifies PDF documents into product categories using different methods.
+"""
+import json
+from typing import List, Tuple, Dict
+from langchain_openai.embeddings import OpenAIEmbeddings
+from langchain_openai.chat_models import ChatOpenAI
+from langchain_core.vectorstores import InMemoryVectorStore
+class ProductClassifier:
+    """Classifies documents into product categories."""
+    def __init__(self, products, temperature=0):
+        self.products = products
+        self.temperature = temperature
+        self.embeddings = OpenAIEmbeddings()
+        self.llm = ChatOpenAI(temperature=temperature)
+    def _extract_text_from_documents(self, documents: List) -> str:
+        """Extract text from a list of Document objects."""
+        if not documents:
+            return ""
+        # Combine all document content
+        text_parts = []
+        for doc in documents:
+            if hasattr(doc, "page_content"):
+                text_parts.append(doc.page_content)
+            elif isinstance(doc, str):
+                text_parts.append(doc)
+            else:
+                # Handle other document types
+                text_parts.append(str(doc))
+        return "\n".join(text_parts)
+    def semantic_similarity_classification(
+        self, documents: List, products: Dict
+    ) -> List[Tuple[str, float]]:
+        """Classify using semantic similarity with embeddings."""
+        # Create vector store from documents
+        vector_store = InMemoryVectorStore.from_documents(documents, self.embeddings)
+        # Calculate similarities with each product
+        similarities = []
+        for product_id, product_info in products.items():
+            # Search for similar content using product description
+            similar_docs = vector_store.similarity_search_with_score(
+                json.dumps(product_info), k=1
+            )
+            # Get similarity score (higher is better)
+            if similar_docs:
+                similarity = similar_docs[0][1]  # Direct similarity score
+            else:
+                similarity = 0
+            similarities.append((product_id, similarity))
+        # Sort by similarity (highest first)
+        similarities.sort(key=lambda x: x[1], reverse=True)
+        return similarities
+    def smart_semantic_classification(
+        self, documents: List, products: Dict
+    ) -> List[Tuple[str, float]]:
+        """Classify using LLM summarization + semantic similarity."""
+        # Generate summary
+        summary = self._generate_summary(self._extract_text_from_documents(documents))
+        # Create vector store with summary
+        vector_store = InMemoryVectorStore.from_texts([summary], self.embeddings)
+        # Calculate similarities with each product
+        similarities = []
+        for product_id, product_info in products.items():
+            # Search for similar content using product description
+            similar_docs = vector_store.similarity_search_with_score(
+                product_info["description"], k=1
+            )
+            # Get similarity score (higher is better)
+            if similar_docs:
+                similarity = similar_docs[0][1]  # Direct similarity score
+            else:
+                similarity = 0
+            similarities.append((product_id, similarity))
+        # Sort by similarity (highest first)
+        similarities.sort(key=lambda x: x[1], reverse=True)
+        return similarities
+    def _generate_summary(self, text: str) -> str:
+        """Generate a summary of the document text."""
+        prompt = f"""Summarize the following document, focusing on the main type and purpose of the document:
+{text}...
+Summary:"""
+        try:
+            response = self.llm.invoke(prompt)
+            return response.content.strip()
+        except Exception:
+            # Fallback to first 500 characters if summarization fails
+            return text[:1000]
+    def keyword_matching_classification(
+        self, documents: List, products: Dict
+    ) -> List[Tuple[str, float]]:
+        """Classify using keyword matching."""
+        # Extract text from documents
+        document_text = self._extract_text_from_documents(documents)
+        document_text_lower = document_text.lower()
+        scores = []
+        for product_id, product_info in products.items():
+            keywords = product_info["keywords"]
+            matches = sum(
+                1 for keyword in keywords if keyword.lower() in document_text_lower
+            )
+            # Calculate score based on keyword matches
+            score = matches / len(keywords) if keywords else 0
+            scores.append((product_id, score))
+        # Sort by score (highest first)
+        scores.sort(key=lambda x: x[1], reverse=True)
+        return scores
+    def hybrid_classification(
+        self, documents: List, products: Dict
+    ) -> List[Tuple[str, float]]:
+        """Classify using both semantic similarity and keyword matching."""
+        semantic_results = self.semantic_similarity_classification(documents, products)
+        keyword_results = self.keyword_matching_classification(documents, products)
+        # Combine scores (70% semantic, 30% keyword)
+        combined_scores = {}
+        for product_id, semantic_score in semantic_results:
+            keyword_score = next(
+                (score for pid, score in keyword_results if pid == product_id), 0
+            )
+            combined_score = 0.7 * semantic_score + 0.3 * keyword_score
+            combined_scores[product_id] = combined_score
+        # Sort by combined score
+        sorted_results = sorted(
+            combined_scores.items(), key=lambda x: x[1], reverse=True
+        )
+        return sorted_results
+    def classify_document(
+        self, documents: List, products: Dict, method: str = "hybrid"
+    ) -> List[Tuple[str, float]]:
+        """Classify document using specified method."""
+        if method == "semantic":
+            return self.semantic_similarity_classification(documents, products)
+        if method == "smart_semantic":
+            return self.smart_semantic_classification(documents, products)
+        if method == "keyword":
+            return self.keyword_matching_classification(documents, products)
+        if method == "hybrid":
+            return self.hybrid_classification(documents, products)
+        raise ValueError(f"Unknown classification method: {method}")
+    def get_summary(self, documents: List) -> str:
+        """Get document summary for display."""
+        return self._generate_summary(self._extract_text_from_documents(documents))
+    def get_product_info(self, product_id: str) -> Dict:
+        """Get product information by ID."""
+        return self.products.get(product_id, {})

pdf_qa/qa_engine.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""
+Question Answering Engine
+Simple Q&A using LangChain and OpenAI.
+"""
+import os
+from langchain_openai.chat_models import ChatOpenAI
+from langchain_openai.embeddings import OpenAIEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain_core.runnables import RunnablePassthrough
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate
+class QAEngine:
+    """Simple question answering engine."""
+    def __init__(self, temperature=0):
+        self.temperature = temperature
+        self.retriever = None
+        self.llm = None
+    def setup(self, chunks):
+        """Setup QA chain with document chunks."""
+        if not os.getenv("OPENAI_API_KEY"):
+            raise ValueError("OPENAI_API_KEY not set")
+        # Create vector store
+        embeddings = OpenAIEmbeddings()
+        vector_store = FAISS.from_texts(chunks, embeddings)
+        # Setup retriever and LLM
+        self.retriever = vector_store.as_retriever(search_kwargs={"k": 3})
+        self.llm = ChatOpenAI(temperature=self.temperature)
+    def ask(self, question):
+        """Ask a question about the document."""
+        if not self.retriever or not self.llm:
+            raise ValueError("Please process a PDF first")
+        if not question.strip():
+            raise ValueError("Please enter a question")
+        # Create prompt template
+        template = """Answer the question based on the following context:
+Context: {context}
+Question: {question}
+Answer:"""
+        prompt = ChatPromptTemplate.from_template(template)
+        # Create chain
+        chain = (
+            {"context": self.retriever, "question": RunnablePassthrough()}
+            | prompt
+            | self.llm
+            | StrOutputParser()
+        )
+        return chain.invoke(question)

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+gradio
+langchain
+langchain-openai
+langchain-community
+langchain-core
+langchain-text-splitters
+openai
+faiss-cpu
+pypdf
+python-dotenv

test_simple.py ADDED Viewed

	@@ -0,0 +1,63 @@

+#!/usr/bin/env python3
+"""
+Simple test for the Document Classification application
+"""
+import os
+import sys
+# Load .env file only in development (optional)
+try:
+    from dotenv import load_dotenv
+    load_dotenv()
+except ImportError:
+    pass  # dotenv not available in production
+def test_imports():
+    """Test that all modules can be imported."""
+    print("🧪 Testing Document Classification Structure")
+    print("=" * 50)
+    try:
+        from pdf_qa.pdf_processor import PDFProcessor
+        from pdf_qa.product_classifier import ProductClassifier
+        print("✅ All modules imported successfully")
+        # Test initialization
+        pdf_processor = PDFProcessor()
+        classifier = ProductClassifier()
+        print("✅ Components initialized successfully")
+        # Test classification methods
+        test_products = {
+            "test": {
+                "name": "Test Product",
+                "description": "A test product for classification",
+                "keywords": ["test", "product"]
+            }
+        }
+        # Test classifier initialization
+        classifier = ProductClassifier(test_products)
+        print("✅ Classification methods available")
+        # Check API key
+        if not os.getenv("OPENAI_API_KEY"):
+            print("⚠️  OPENAI_API_KEY not set (expected for testing)")
+        else:
+            print("✅ OPENAI_API_KEY found")
+        print("\n🎉 Document classification structure working correctly!")
+        print("\nTo run the app:")
+        print("1. Set OPENAI_API_KEY environment variable")
+        print("2. Run: python app.py")
+        return True
+    except Exception as e:
+        print(f"❌ Error: {str(e)}")
+        return False
+if __name__ == "__main__":
+    success = test_imports()
+    sys.exit(0 if success else 1)