Spaces:
Paused
Paused
| # Legal Dashboard - Scraping & Rating System Documentation | |
| ## Overview | |
| The Legal Dashboard Scraping & Rating System is a comprehensive web scraping and data quality evaluation platform designed specifically for legal document processing. The system provides advanced scraping capabilities with multiple strategies, intelligent data rating, and a modern web dashboard for monitoring and control. | |
| ## Features | |
| ### π·οΈ Advanced Web Scraping | |
| - **Multiple Scraping Strategies**: General, Legal Documents, News Articles, Academic Papers, Government Sites, Custom | |
| - **Async Processing**: High-performance asynchronous scraping with configurable delays | |
| - **Content Extraction**: Intelligent content extraction based on strategy and page structure | |
| - **Error Handling**: Comprehensive error handling and logging | |
| - **Rate Limiting**: Built-in rate limiting to respect website policies | |
| ### β Intelligent Data Rating | |
| - **Multi-Criteria Evaluation**: Source credibility, content completeness, OCR accuracy, data freshness, content relevance, technical quality | |
| - **Dynamic Scoring**: Real-time rating updates as data is processed | |
| - **Quality Indicators**: Automatic detection of legal document patterns and quality markers | |
| - **Confidence Scoring**: Statistical confidence levels for rating accuracy | |
| ### π Real-Time Dashboard | |
| - **Live Monitoring**: Real-time job progress and system statistics | |
| - **Interactive Charts**: Rating distribution and language analysis | |
| - **Job Management**: Start, monitor, and control scraping jobs | |
| - **Data Visualization**: Comprehensive statistics and analytics | |
| ### π§ API-First Design | |
| - **RESTful API**: Complete REST API for all operations | |
| - **WebSocket Support**: Real-time updates and notifications | |
| - **Comprehensive Endpoints**: Full CRUD operations for scraping and rating | |
| - **Health Monitoring**: System health checks and status monitoring | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
| β Frontend β β FastAPI β β Database β | |
| β Dashboard βββββΊβ Backend βββββΊβ SQLite β | |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Services β | |
| β β | |
| β β’ Scraping β | |
| β β’ Rating β | |
| β β’ OCR β | |
| βββββββββββββββββββ | |
| ``` | |
| ## Installation & Setup | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - FastAPI | |
| - SQLite3 | |
| - Required Python packages (see requirements.txt) | |
| ### Quick Start | |
| 1. **Clone the repository**: | |
| ```bash | |
| git clone <repository-url> | |
| cd legal_dashboard_ocr | |
| ``` | |
| 2. **Install dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Start the application**: | |
| ```bash | |
| cd legal_dashboard_ocr | |
| uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| 4. **Access the dashboard**: | |
| ``` | |
| http://localhost:8000/scraping_dashboard.html | |
| ``` | |
| ### Docker Deployment | |
| ```bash | |
| # Build the Docker image | |
| docker build -t legal-dashboard-scraping . | |
| # Run the container | |
| docker run -p 8000:8000 legal-dashboard-scraping | |
| ``` | |
| ## API Reference | |
| ### Scraping Endpoints | |
| #### POST /api/scrape | |
| Start a new scraping job. | |
| **Request Body**: | |
| ```json | |
| { | |
| "urls": ["https://example.com/page1", "https://example.com/page2"], | |
| "strategy": "legal_documents", | |
| "keywords": ["contract", "agreement"], | |
| "content_types": ["html", "pdf"], | |
| "max_depth": 1, | |
| "delay_between_requests": 1.0 | |
| } | |
| ``` | |
| **Response**: | |
| ```json | |
| { | |
| "job_id": "scrape_job_20240101_120000_abc123", | |
| "status": "started", | |
| "message": "Scraping job started successfully with 2 URLs" | |
| } | |
| ``` | |
| #### GET /api/scrape/status | |
| Get status of all scraping jobs. | |
| **Response**: | |
| ```json | |
| [ | |
| { | |
| "job_id": "scrape_job_20240101_120000_abc123", | |
| "status": "processing", | |
| "total_items": 2, | |
| "completed_items": 1, | |
| "failed_items": 0, | |
| "progress": 0.5, | |
| "created_at": "2024-01-01T12:00:00Z", | |
| "strategy": "legal_documents" | |
| } | |
| ] | |
| ``` | |
| #### GET /api/scrape/items | |
| Get scraped items with optional filtering. | |
| **Query Parameters**: | |
| - `job_id` (optional): Filter by job ID | |
| - `limit` (default: 100): Maximum items to return | |
| - `offset` (default: 0): Number of items to skip | |
| **Response**: | |
| ```json | |
| [ | |
| { | |
| "id": "item_20240101_120000_def456", | |
| "url": "https://example.com/page1", | |
| "title": "Legal Document Title", | |
| "content": "Extracted content...", | |
| "metadata": {...}, | |
| "timestamp": "2024-01-01T12:00:00Z", | |
| "rating_score": 0.85, | |
| "processing_status": "completed", | |
| "word_count": 1500, | |
| "language": "english", | |
| "domain": "example.com" | |
| } | |
| ] | |
| ``` | |
| ### Rating Endpoints | |
| #### POST /api/rating/rate-all | |
| Rate all unrated scraped items. | |
| **Response**: | |
| ```json | |
| { | |
| "total_items": 50, | |
| "rated_count": 45, | |
| "failed_count": 5, | |
| "message": "Rated 45 items, 5 failed" | |
| } | |
| ``` | |
| #### GET /api/rating/summary | |
| Get comprehensive rating summary. | |
| **Response**: | |
| ```json | |
| { | |
| "total_rated": 100, | |
| "average_score": 0.75, | |
| "score_range": { | |
| "min": 0.2, | |
| "max": 0.95 | |
| }, | |
| "average_confidence": 0.82, | |
| "rating_level_distribution": { | |
| "excellent": 25, | |
| "good": 40, | |
| "average": 25, | |
| "poor": 10 | |
| }, | |
| "criteria_averages": { | |
| "source_credibility": 0.8, | |
| "content_completeness": 0.7, | |
| "ocr_accuracy": 0.85 | |
| }, | |
| "recent_ratings_24h": 15 | |
| } | |
| ``` | |
| #### GET /api/rating/low-quality | |
| Get items with low quality ratings. | |
| **Query Parameters**: | |
| - `threshold` (default: 0.4): Quality threshold | |
| - `limit` (default: 50): Maximum items to return | |
| **Response**: | |
| ```json | |
| { | |
| "threshold": 0.4, | |
| "total_items": 10, | |
| "items": [...] | |
| } | |
| ``` | |
| ## Scraping Strategies | |
| ### 1. General Strategy | |
| - Extracts all text content from web pages | |
| - Suitable for general web scraping tasks | |
| - Minimal content filtering | |
| ### 2. Legal Documents Strategy | |
| - Focuses on legal document content | |
| - Extracts structured legal text | |
| - Identifies legal patterns and terminology | |
| - Optimized for Persian and English legal content | |
| ### 3. News Articles Strategy | |
| - Extracts news article content | |
| - Removes navigation and advertising | |
| - Focuses on article body and headlines | |
| ### 4. Academic Papers Strategy | |
| - Extracts academic content | |
| - Preserves citations and references | |
| - Maintains document structure | |
| ### 5. Government Sites Strategy | |
| - Optimized for government websites | |
| - Extracts official documents and announcements | |
| - Handles government-specific content structures | |
| ### 6. Custom Strategy | |
| - User-defined content extraction rules | |
| - Configurable selectors and patterns | |
| - Flexible content processing | |
| ## Rating Criteria | |
| ### Source Credibility (25%) | |
| - Domain authority and reputation | |
| - Government/educational institution status | |
| - HTTPS security | |
| - Official indicators in metadata | |
| ### Content Completeness (25%) | |
| - Word count and content length | |
| - Structured content (chapters, sections) | |
| - Legal document patterns | |
| - Quality indicators | |
| ### OCR Accuracy (20%) | |
| - Text quality and readability | |
| - Character recognition accuracy | |
| - Sentence structure quality | |
| - Formatting consistency | |
| ### Data Freshness (15%) | |
| - Content age and timeliness | |
| - Update frequency | |
| - Historical relevance | |
| ### Content Relevance (10%) | |
| - Legal terminology density | |
| - Domain-specific language | |
| - Official language indicators | |
| ### Technical Quality (5%) | |
| - Document structure | |
| - Formatting consistency | |
| - Metadata quality | |
| - Content organization | |
| ## Database Schema | |
| ### scraped_items Table | |
| ```sql | |
| CREATE TABLE scraped_items ( | |
| id TEXT PRIMARY KEY, | |
| url TEXT NOT NULL, | |
| title TEXT, | |
| content TEXT, | |
| metadata TEXT, | |
| timestamp TEXT, | |
| source_url TEXT, | |
| rating_score REAL DEFAULT 0.0, | |
| processing_status TEXT DEFAULT 'pending', | |
| error_message TEXT, | |
| strategy_used TEXT, | |
| content_hash TEXT, | |
| word_count INTEGER DEFAULT 0, | |
| language TEXT DEFAULT 'unknown', | |
| domain TEXT | |
| ); | |
| ``` | |
| ### rating_results Table | |
| ```sql | |
| CREATE TABLE rating_results ( | |
| id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| item_id TEXT NOT NULL, | |
| overall_score REAL, | |
| criteria_scores TEXT, | |
| rating_level TEXT, | |
| confidence REAL, | |
| timestamp TEXT, | |
| evaluator TEXT, | |
| notes TEXT, | |
| FOREIGN KEY (item_id) REFERENCES scraped_items (id) | |
| ); | |
| ``` | |
| ### scraping_jobs Table | |
| ```sql | |
| CREATE TABLE scraping_jobs ( | |
| job_id TEXT PRIMARY KEY, | |
| urls TEXT, | |
| strategy TEXT, | |
| keywords TEXT, | |
| content_types TEXT, | |
| max_depth INTEGER DEFAULT 1, | |
| delay_between_requests REAL DEFAULT 1.0, | |
| timeout INTEGER DEFAULT 30, | |
| created_at TEXT, | |
| status TEXT DEFAULT 'pending', | |
| total_items INTEGER DEFAULT 0, | |
| completed_items INTEGER DEFAULT 0, | |
| failed_items INTEGER DEFAULT 0 | |
| ); | |
| ``` | |
| ## Configuration | |
| ### Rating Configuration | |
| ```python | |
| from app.services.rating_service import RatingConfig | |
| config = RatingConfig( | |
| source_credibility_weight=0.25, | |
| content_completeness_weight=0.25, | |
| ocr_accuracy_weight=0.20, | |
| data_freshness_weight=0.15, | |
| content_relevance_weight=0.10, | |
| technical_quality_weight=0.05, | |
| excellent_threshold=0.8, | |
| good_threshold=0.6, | |
| average_threshold=0.4, | |
| poor_threshold=0.2 | |
| ) | |
| ``` | |
| ### Scraping Configuration | |
| ```python | |
| from app.services.scraping_service import ScrapingService | |
| scraping_service = ScrapingService( | |
| db_path="legal_documents.db", | |
| max_workers=10, | |
| timeout=30, | |
| user_agent="Legal-Dashboard-Scraper/1.0" | |
| ) | |
| ``` | |
| ## Usage Examples | |
| ### Starting a Scraping Job | |
| ```python | |
| import asyncio | |
| from app.services.scraping_service import ScrapingService, ScrapingStrategy | |
| async def scrape_legal_documents(): | |
| service = ScrapingService() | |
| urls = [ | |
| "https://court.gov.ir/document1", | |
| "https://justice.gov.ir/document2" | |
| ] | |
| job_id = await service.start_scraping_job( | |
| urls=urls, | |
| strategy=ScrapingStrategy.LEGAL_DOCUMENTS, | |
| keywords=["ΩΨ±Ψ§Ψ±Ψ―Ψ§Ψ―", "contract", "agreement"], | |
| max_depth=1, | |
| delay=2.0 | |
| ) | |
| print(f"Started scraping job: {job_id}") | |
| # Run the scraping job | |
| asyncio.run(scrape_legal_documents()) | |
| ``` | |
| ### Rating Scraped Items | |
| ```python | |
| import asyncio | |
| from app.services.rating_service import RatingService | |
| async def rate_items(): | |
| service = RatingService() | |
| # Get scraped items | |
| items = await scraping_service.get_scraped_items() | |
| # Rate each item | |
| for item in items: | |
| if item['rating_score'] == 0.0: # Unrated items | |
| result = await service.rate_item(item) | |
| print(f"Rated item {item['id']}: {result.rating_level.value} ({result.overall_score})") | |
| # Run the rating process | |
| asyncio.run(rate_items()) | |
| ``` | |
| ### API Integration | |
| ```python | |
| import requests | |
| # Start a scraping job | |
| response = requests.post("http://localhost:8000/api/scrape", json={ | |
| "urls": ["https://example.com/legal-doc"], | |
| "strategy": "legal_documents", | |
| "max_depth": 1 | |
| }) | |
| job_id = response.json()["job_id"] | |
| # Monitor job progress | |
| while True: | |
| status_response = requests.get(f"http://localhost:8000/api/scrape/status/{job_id}") | |
| status = status_response.json() | |
| if status["status"] == "completed": | |
| break | |
| time.sleep(5) | |
| # Get rated items | |
| items_response = requests.get("http://localhost:8000/api/scrape/items") | |
| items = items_response.json() | |
| # Get rating summary | |
| summary_response = requests.get("http://localhost:8000/api/rating/summary") | |
| summary = summary_response.json() | |
| ``` | |
| ## Testing | |
| ### Running Tests | |
| ```bash | |
| # Run all tests | |
| pytest tests/test_scraping_system.py -v | |
| # Run specific test categories | |
| pytest tests/test_scraping_system.py::TestScrapingService -v | |
| pytest tests/test_scraping_system.py::TestRatingService -v | |
| pytest tests/test_scraping_system.py::TestScrapingAPI -v | |
| # Run with coverage | |
| pytest tests/test_scraping_system.py --cov=app.services --cov-report=html | |
| ``` | |
| ### Test Categories | |
| - **Unit Tests**: Individual component testing | |
| - **Integration Tests**: End-to-end workflow testing | |
| - **API Tests**: REST API endpoint testing | |
| - **Performance Tests**: Load and stress testing | |
| - **Error Handling Tests**: Exception and error scenario testing | |
| ## Monitoring & Logging | |
| ### Log Levels | |
| - **INFO**: General operational information | |
| - **WARNING**: Non-critical issues and warnings | |
| - **ERROR**: Error conditions and failures | |
| - **DEBUG**: Detailed debugging information | |
| ### Key Metrics | |
| - **Scraping Jobs**: Active jobs, completion rates, failure rates | |
| - **Data Quality**: Average ratings, rating distributions, quality trends | |
| - **System Performance**: Response times, throughput, resource usage | |
| - **Error Rates**: Failed requests, parsing errors, rating failures | |
| ### Health Checks | |
| ```bash | |
| # Check system health | |
| curl http://localhost:8000/api/health | |
| # Check scraping service health | |
| curl http://localhost:8000/api/scrape/statistics | |
| # Check rating service health | |
| curl http://localhost:8000/api/rating/summary | |
| ``` | |
| ## Troubleshooting | |
| ### Common Issues | |
| #### 1. Scraping Jobs Not Starting | |
| **Symptoms**: Jobs remain in "pending" status | |
| **Solutions**: | |
| - Check network connectivity | |
| - Verify URL accessibility | |
| - Review rate limiting settings | |
| - Check server logs for errors | |
| #### 2. Low Rating Scores | |
| **Symptoms**: Items consistently getting low ratings | |
| **Solutions**: | |
| - Review content quality and completeness | |
| - Check source credibility settings | |
| - Adjust rating criteria weights | |
| - Verify OCR accuracy for text extraction | |
| #### 3. Database Errors | |
| **Symptoms**: Database connection failures or data corruption | |
| **Solutions**: | |
| - Check database file permissions | |
| - Verify SQLite installation | |
| - Review database schema | |
| - Check for disk space issues | |
| #### 4. Performance Issues | |
| **Symptoms**: Slow response times or high resource usage | |
| **Solutions**: | |
| - Reduce concurrent scraping jobs | |
| - Increase delay between requests | |
| - Optimize database queries | |
| - Review memory usage patterns | |
| ### Debug Mode | |
| Enable debug logging for detailed troubleshooting: | |
| ```python | |
| import logging | |
| logging.basicConfig(level=logging.DEBUG) | |
| ``` | |
| ### Error Recovery | |
| The system includes automatic error recovery mechanisms: | |
| - **Job Retry**: Failed scraping jobs can be retried | |
| - **Data Validation**: Automatic validation of scraped content | |
| - **Graceful Degradation**: System continues operating with partial failures | |
| - **Error Logging**: Comprehensive error logging for analysis | |
| ## Security Considerations | |
| ### Data Protection | |
| - **Encryption**: Sensitive data encrypted at rest | |
| - **Access Control**: API authentication and authorization | |
| - **Input Validation**: Comprehensive input sanitization | |
| - **Rate Limiting**: Protection against abuse | |
| ### Privacy Compliance | |
| - **Data Retention**: Configurable data retention policies | |
| - **User Consent**: Respect for website terms of service | |
| - **Data Minimization**: Only necessary data is collected | |
| - **Right to Deletion**: User data can be deleted on request | |
| ### Network Security | |
| - **HTTPS**: All communications encrypted | |
| - **Certificate Validation**: Proper SSL certificate validation | |
| - **Firewall Rules**: Network access controls | |
| - **DDoS Protection**: Rate limiting and traffic filtering | |
| ## Performance Optimization | |
| ### Scraping Performance | |
| - **Async Processing**: Non-blocking I/O operations | |
| - **Connection Pooling**: Reuse HTTP connections | |
| - **Caching**: Cache frequently accessed content | |
| - **Parallel Processing**: Multiple concurrent scraping jobs | |
| ### Database Performance | |
| - **Indexing**: Optimized database indexes | |
| - **Query Optimization**: Efficient SQL queries | |
| - **Connection Pooling**: Database connection management | |
| - **Data Archiving**: Automatic archiving of old data | |
| ### Memory Management | |
| - **Streaming**: Process large datasets in chunks | |
| - **Garbage Collection**: Proper memory cleanup | |
| - **Resource Limits**: Configurable memory limits | |
| - **Monitoring**: Real-time memory usage tracking | |
| ## Future Enhancements | |
| ### Planned Features | |
| - **Machine Learning**: Advanced content classification | |
| - **Natural Language Processing**: Enhanced text analysis | |
| - **Multi-language Support**: Additional language support | |
| - **Cloud Integration**: Cloud storage and processing | |
| - **Advanced Analytics**: Detailed analytics and reporting | |
| ### Scalability Improvements | |
| - **Microservices Architecture**: Service decomposition | |
| - **Load Balancing**: Distributed processing | |
| - **Caching Layer**: Redis integration | |
| - **Message Queues**: Asynchronous processing | |
| ## Support & Contributing | |
| ### Getting Help | |
| - **Documentation**: Comprehensive documentation and examples | |
| - **Community**: Active community support | |
| - **Issues**: GitHub issue tracking | |
| - **Discussions**: Community discussions and Q&A | |
| ### Contributing | |
| - **Code Standards**: Follow PEP 8 and project guidelines | |
| - **Testing**: Include comprehensive tests | |
| - **Documentation**: Update documentation for changes | |
| - **Review Process**: Code review and approval process | |
| ### License | |
| This project is licensed under the MIT License. See LICENSE file for details. | |
| --- | |
| **Note**: This documentation is continuously updated. For the latest version, please check the project repository. |