Hoghoghi / Doc /SCRAPING_SYSTEM_DOCUMENTATION.md
Really-amin's picture
Upload 143 files
c636ebf verified
|
raw
history blame
18 kB
# Legal Dashboard - Scraping & Rating System Documentation
## Overview
The Legal Dashboard Scraping & Rating System is a comprehensive web scraping and data quality evaluation platform designed specifically for legal document processing. The system provides advanced scraping capabilities with multiple strategies, intelligent data rating, and a modern web dashboard for monitoring and control.
## Features
### πŸ•·οΈ Advanced Web Scraping
- **Multiple Scraping Strategies**: General, Legal Documents, News Articles, Academic Papers, Government Sites, Custom
- **Async Processing**: High-performance asynchronous scraping with configurable delays
- **Content Extraction**: Intelligent content extraction based on strategy and page structure
- **Error Handling**: Comprehensive error handling and logging
- **Rate Limiting**: Built-in rate limiting to respect website policies
### ⭐ Intelligent Data Rating
- **Multi-Criteria Evaluation**: Source credibility, content completeness, OCR accuracy, data freshness, content relevance, technical quality
- **Dynamic Scoring**: Real-time rating updates as data is processed
- **Quality Indicators**: Automatic detection of legal document patterns and quality markers
- **Confidence Scoring**: Statistical confidence levels for rating accuracy
### πŸ“Š Real-Time Dashboard
- **Live Monitoring**: Real-time job progress and system statistics
- **Interactive Charts**: Rating distribution and language analysis
- **Job Management**: Start, monitor, and control scraping jobs
- **Data Visualization**: Comprehensive statistics and analytics
### πŸ”§ API-First Design
- **RESTful API**: Complete REST API for all operations
- **WebSocket Support**: Real-time updates and notifications
- **Comprehensive Endpoints**: Full CRUD operations for scraping and rating
- **Health Monitoring**: System health checks and status monitoring
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Frontend β”‚ β”‚ FastAPI β”‚ β”‚ Database β”‚
β”‚ Dashboard │◄──►│ Backend │◄──►│ SQLite β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Services β”‚
β”‚ β”‚
β”‚ β€’ Scraping β”‚
β”‚ β€’ Rating β”‚
β”‚ β€’ OCR β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Installation & Setup
### Prerequisites
- Python 3.8+
- FastAPI
- SQLite3
- Required Python packages (see requirements.txt)
### Quick Start
1. **Clone the repository**:
```bash
git clone <repository-url>
cd legal_dashboard_ocr
```
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
3. **Start the application**:
```bash
cd legal_dashboard_ocr
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
4. **Access the dashboard**:
```
http://localhost:8000/scraping_dashboard.html
```
### Docker Deployment
```bash
# Build the Docker image
docker build -t legal-dashboard-scraping .
# Run the container
docker run -p 8000:8000 legal-dashboard-scraping
```
## API Reference
### Scraping Endpoints
#### POST /api/scrape
Start a new scraping job.
**Request Body**:
```json
{
"urls": ["https://example.com/page1", "https://example.com/page2"],
"strategy": "legal_documents",
"keywords": ["contract", "agreement"],
"content_types": ["html", "pdf"],
"max_depth": 1,
"delay_between_requests": 1.0
}
```
**Response**:
```json
{
"job_id": "scrape_job_20240101_120000_abc123",
"status": "started",
"message": "Scraping job started successfully with 2 URLs"
}
```
#### GET /api/scrape/status
Get status of all scraping jobs.
**Response**:
```json
[
{
"job_id": "scrape_job_20240101_120000_abc123",
"status": "processing",
"total_items": 2,
"completed_items": 1,
"failed_items": 0,
"progress": 0.5,
"created_at": "2024-01-01T12:00:00Z",
"strategy": "legal_documents"
}
]
```
#### GET /api/scrape/items
Get scraped items with optional filtering.
**Query Parameters**:
- `job_id` (optional): Filter by job ID
- `limit` (default: 100): Maximum items to return
- `offset` (default: 0): Number of items to skip
**Response**:
```json
[
{
"id": "item_20240101_120000_def456",
"url": "https://example.com/page1",
"title": "Legal Document Title",
"content": "Extracted content...",
"metadata": {...},
"timestamp": "2024-01-01T12:00:00Z",
"rating_score": 0.85,
"processing_status": "completed",
"word_count": 1500,
"language": "english",
"domain": "example.com"
}
]
```
### Rating Endpoints
#### POST /api/rating/rate-all
Rate all unrated scraped items.
**Response**:
```json
{
"total_items": 50,
"rated_count": 45,
"failed_count": 5,
"message": "Rated 45 items, 5 failed"
}
```
#### GET /api/rating/summary
Get comprehensive rating summary.
**Response**:
```json
{
"total_rated": 100,
"average_score": 0.75,
"score_range": {
"min": 0.2,
"max": 0.95
},
"average_confidence": 0.82,
"rating_level_distribution": {
"excellent": 25,
"good": 40,
"average": 25,
"poor": 10
},
"criteria_averages": {
"source_credibility": 0.8,
"content_completeness": 0.7,
"ocr_accuracy": 0.85
},
"recent_ratings_24h": 15
}
```
#### GET /api/rating/low-quality
Get items with low quality ratings.
**Query Parameters**:
- `threshold` (default: 0.4): Quality threshold
- `limit` (default: 50): Maximum items to return
**Response**:
```json
{
"threshold": 0.4,
"total_items": 10,
"items": [...]
}
```
## Scraping Strategies
### 1. General Strategy
- Extracts all text content from web pages
- Suitable for general web scraping tasks
- Minimal content filtering
### 2. Legal Documents Strategy
- Focuses on legal document content
- Extracts structured legal text
- Identifies legal patterns and terminology
- Optimized for Persian and English legal content
### 3. News Articles Strategy
- Extracts news article content
- Removes navigation and advertising
- Focuses on article body and headlines
### 4. Academic Papers Strategy
- Extracts academic content
- Preserves citations and references
- Maintains document structure
### 5. Government Sites Strategy
- Optimized for government websites
- Extracts official documents and announcements
- Handles government-specific content structures
### 6. Custom Strategy
- User-defined content extraction rules
- Configurable selectors and patterns
- Flexible content processing
## Rating Criteria
### Source Credibility (25%)
- Domain authority and reputation
- Government/educational institution status
- HTTPS security
- Official indicators in metadata
### Content Completeness (25%)
- Word count and content length
- Structured content (chapters, sections)
- Legal document patterns
- Quality indicators
### OCR Accuracy (20%)
- Text quality and readability
- Character recognition accuracy
- Sentence structure quality
- Formatting consistency
### Data Freshness (15%)
- Content age and timeliness
- Update frequency
- Historical relevance
### Content Relevance (10%)
- Legal terminology density
- Domain-specific language
- Official language indicators
### Technical Quality (5%)
- Document structure
- Formatting consistency
- Metadata quality
- Content organization
## Database Schema
### scraped_items Table
```sql
CREATE TABLE scraped_items (
id TEXT PRIMARY KEY,
url TEXT NOT NULL,
title TEXT,
content TEXT,
metadata TEXT,
timestamp TEXT,
source_url TEXT,
rating_score REAL DEFAULT 0.0,
processing_status TEXT DEFAULT 'pending',
error_message TEXT,
strategy_used TEXT,
content_hash TEXT,
word_count INTEGER DEFAULT 0,
language TEXT DEFAULT 'unknown',
domain TEXT
);
```
### rating_results Table
```sql
CREATE TABLE rating_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
item_id TEXT NOT NULL,
overall_score REAL,
criteria_scores TEXT,
rating_level TEXT,
confidence REAL,
timestamp TEXT,
evaluator TEXT,
notes TEXT,
FOREIGN KEY (item_id) REFERENCES scraped_items (id)
);
```
### scraping_jobs Table
```sql
CREATE TABLE scraping_jobs (
job_id TEXT PRIMARY KEY,
urls TEXT,
strategy TEXT,
keywords TEXT,
content_types TEXT,
max_depth INTEGER DEFAULT 1,
delay_between_requests REAL DEFAULT 1.0,
timeout INTEGER DEFAULT 30,
created_at TEXT,
status TEXT DEFAULT 'pending',
total_items INTEGER DEFAULT 0,
completed_items INTEGER DEFAULT 0,
failed_items INTEGER DEFAULT 0
);
```
## Configuration
### Rating Configuration
```python
from app.services.rating_service import RatingConfig
config = RatingConfig(
source_credibility_weight=0.25,
content_completeness_weight=0.25,
ocr_accuracy_weight=0.20,
data_freshness_weight=0.15,
content_relevance_weight=0.10,
technical_quality_weight=0.05,
excellent_threshold=0.8,
good_threshold=0.6,
average_threshold=0.4,
poor_threshold=0.2
)
```
### Scraping Configuration
```python
from app.services.scraping_service import ScrapingService
scraping_service = ScrapingService(
db_path="legal_documents.db",
max_workers=10,
timeout=30,
user_agent="Legal-Dashboard-Scraper/1.0"
)
```
## Usage Examples
### Starting a Scraping Job
```python
import asyncio
from app.services.scraping_service import ScrapingService, ScrapingStrategy
async def scrape_legal_documents():
service = ScrapingService()
urls = [
"https://court.gov.ir/document1",
"https://justice.gov.ir/document2"
]
job_id = await service.start_scraping_job(
urls=urls,
strategy=ScrapingStrategy.LEGAL_DOCUMENTS,
keywords=["Ω‚Ψ±Ψ§Ψ±Ψ―Ψ§Ψ―", "contract", "agreement"],
max_depth=1,
delay=2.0
)
print(f"Started scraping job: {job_id}")
# Run the scraping job
asyncio.run(scrape_legal_documents())
```
### Rating Scraped Items
```python
import asyncio
from app.services.rating_service import RatingService
async def rate_items():
service = RatingService()
# Get scraped items
items = await scraping_service.get_scraped_items()
# Rate each item
for item in items:
if item['rating_score'] == 0.0: # Unrated items
result = await service.rate_item(item)
print(f"Rated item {item['id']}: {result.rating_level.value} ({result.overall_score})")
# Run the rating process
asyncio.run(rate_items())
```
### API Integration
```python
import requests
# Start a scraping job
response = requests.post("http://localhost:8000/api/scrape", json={
"urls": ["https://example.com/legal-doc"],
"strategy": "legal_documents",
"max_depth": 1
})
job_id = response.json()["job_id"]
# Monitor job progress
while True:
status_response = requests.get(f"http://localhost:8000/api/scrape/status/{job_id}")
status = status_response.json()
if status["status"] == "completed":
break
time.sleep(5)
# Get rated items
items_response = requests.get("http://localhost:8000/api/scrape/items")
items = items_response.json()
# Get rating summary
summary_response = requests.get("http://localhost:8000/api/rating/summary")
summary = summary_response.json()
```
## Testing
### Running Tests
```bash
# Run all tests
pytest tests/test_scraping_system.py -v
# Run specific test categories
pytest tests/test_scraping_system.py::TestScrapingService -v
pytest tests/test_scraping_system.py::TestRatingService -v
pytest tests/test_scraping_system.py::TestScrapingAPI -v
# Run with coverage
pytest tests/test_scraping_system.py --cov=app.services --cov-report=html
```
### Test Categories
- **Unit Tests**: Individual component testing
- **Integration Tests**: End-to-end workflow testing
- **API Tests**: REST API endpoint testing
- **Performance Tests**: Load and stress testing
- **Error Handling Tests**: Exception and error scenario testing
## Monitoring & Logging
### Log Levels
- **INFO**: General operational information
- **WARNING**: Non-critical issues and warnings
- **ERROR**: Error conditions and failures
- **DEBUG**: Detailed debugging information
### Key Metrics
- **Scraping Jobs**: Active jobs, completion rates, failure rates
- **Data Quality**: Average ratings, rating distributions, quality trends
- **System Performance**: Response times, throughput, resource usage
- **Error Rates**: Failed requests, parsing errors, rating failures
### Health Checks
```bash
# Check system health
curl http://localhost:8000/api/health
# Check scraping service health
curl http://localhost:8000/api/scrape/statistics
# Check rating service health
curl http://localhost:8000/api/rating/summary
```
## Troubleshooting
### Common Issues
#### 1. Scraping Jobs Not Starting
**Symptoms**: Jobs remain in "pending" status
**Solutions**:
- Check network connectivity
- Verify URL accessibility
- Review rate limiting settings
- Check server logs for errors
#### 2. Low Rating Scores
**Symptoms**: Items consistently getting low ratings
**Solutions**:
- Review content quality and completeness
- Check source credibility settings
- Adjust rating criteria weights
- Verify OCR accuracy for text extraction
#### 3. Database Errors
**Symptoms**: Database connection failures or data corruption
**Solutions**:
- Check database file permissions
- Verify SQLite installation
- Review database schema
- Check for disk space issues
#### 4. Performance Issues
**Symptoms**: Slow response times or high resource usage
**Solutions**:
- Reduce concurrent scraping jobs
- Increase delay between requests
- Optimize database queries
- Review memory usage patterns
### Debug Mode
Enable debug logging for detailed troubleshooting:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```
### Error Recovery
The system includes automatic error recovery mechanisms:
- **Job Retry**: Failed scraping jobs can be retried
- **Data Validation**: Automatic validation of scraped content
- **Graceful Degradation**: System continues operating with partial failures
- **Error Logging**: Comprehensive error logging for analysis
## Security Considerations
### Data Protection
- **Encryption**: Sensitive data encrypted at rest
- **Access Control**: API authentication and authorization
- **Input Validation**: Comprehensive input sanitization
- **Rate Limiting**: Protection against abuse
### Privacy Compliance
- **Data Retention**: Configurable data retention policies
- **User Consent**: Respect for website terms of service
- **Data Minimization**: Only necessary data is collected
- **Right to Deletion**: User data can be deleted on request
### Network Security
- **HTTPS**: All communications encrypted
- **Certificate Validation**: Proper SSL certificate validation
- **Firewall Rules**: Network access controls
- **DDoS Protection**: Rate limiting and traffic filtering
## Performance Optimization
### Scraping Performance
- **Async Processing**: Non-blocking I/O operations
- **Connection Pooling**: Reuse HTTP connections
- **Caching**: Cache frequently accessed content
- **Parallel Processing**: Multiple concurrent scraping jobs
### Database Performance
- **Indexing**: Optimized database indexes
- **Query Optimization**: Efficient SQL queries
- **Connection Pooling**: Database connection management
- **Data Archiving**: Automatic archiving of old data
### Memory Management
- **Streaming**: Process large datasets in chunks
- **Garbage Collection**: Proper memory cleanup
- **Resource Limits**: Configurable memory limits
- **Monitoring**: Real-time memory usage tracking
## Future Enhancements
### Planned Features
- **Machine Learning**: Advanced content classification
- **Natural Language Processing**: Enhanced text analysis
- **Multi-language Support**: Additional language support
- **Cloud Integration**: Cloud storage and processing
- **Advanced Analytics**: Detailed analytics and reporting
### Scalability Improvements
- **Microservices Architecture**: Service decomposition
- **Load Balancing**: Distributed processing
- **Caching Layer**: Redis integration
- **Message Queues**: Asynchronous processing
## Support & Contributing
### Getting Help
- **Documentation**: Comprehensive documentation and examples
- **Community**: Active community support
- **Issues**: GitHub issue tracking
- **Discussions**: Community discussions and Q&A
### Contributing
- **Code Standards**: Follow PEP 8 and project guidelines
- **Testing**: Include comprehensive tests
- **Documentation**: Update documentation for changes
- **Review Process**: Code review and approval process
### License
This project is licensed under the MIT License. See LICENSE file for details.
---
**Note**: This documentation is continuously updated. For the latest version, please check the project repository.