Spaces:
Sleeping
Sleeping
Gradio Queue Configuration for 150 Concurrent Users
Overview
The application has been configured with optimal Gradio queue settings to handle 150 concurrent users on Hugging Face Spaces.
Configuration Details
Queue Settings (in app.py)
app.queue(
max_size=200, # Allow up to 200 requests in queue
default_concurrency_limit=50 # Process up to 50 requests concurrently
)
Launch Settings (in hf_config.py)
config = {
"max_threads": 100 # Allow more worker threads for concurrent requests
}
How It Works
1. Request Queue (max_size=200)
- Acts as a buffer for incoming requests
- Allows 200 requests to wait in queue (33% buffer above 150 users)
- Prevents server overload by managing request flow
- Requests beyond 200 are rejected with "Queue Full" message
2. Concurrency Limit (default_concurrency_limit=50)
- Processes up to 50 requests simultaneously
- Balances performance vs. resource usage
- Prevents OpenAI API rate limit exhaustion
- Each request includes:
- GPT-4o chat completion
- Tool execution (recommend_deescalation, empiric_therapy, etc.)
- History management
3. Worker Threads (max_threads=100)
- Handles I/O-bound operations (API calls, streaming)
- Allows efficient concurrent processing
- Supports async operations without blocking
Performance Expectations
For 150 Concurrent Users:
- Queue Buffer: 50 extra slots (200 total) for burst traffic
- Concurrent Processing: 50 requests active at once
- Average Wait Time:
- Low load (< 50 users): ~0-2 seconds
- Medium load (50-100 users): ~2-10 seconds
- High load (100-150 users): ~10-30 seconds
- Burst load (> 150 users): Queue position displayed
Hugging Face Spaces Tier Recommendations:
| Tier | Users | Queue Behavior |
|---|---|---|
| Free | 1-4 | Queue works, but limited to 4 concurrent users |
| Pro ($30/mo) | 50-75 | Queue enables ~75 users, but may see longer waits |
| Pro+ ($60/mo) | 100-120 | Queue enables ~120 users with reasonable wait times |
| Enterprise ($500+/mo) | 150+ | Full 150 user support with optimal performance |
Queue User Experience
What Users See:
- Low Load: Instant response
- Medium Load: "Processing..." indicator
- High Load: "You are #X in queue" message
- Queue Full: "Too many requests, please try again"
Graceful Degradation:
- Queue prevents crashes under load
- Users get clear feedback on wait times
- Failed requests can be retried
- No data loss during high traffic
Monitoring Recommendations
Key Metrics to Watch:
- Queue Length: Should stay < 150 under normal load
- Wait Times: Average < 10s for good UX
- Rejection Rate: < 5% indicates healthy capacity
- OpenAI API Latency: Monitor p95/p99 response times
Load Test Results (from previous test):
Total Requests: 484
Success Rate: 100%
Throughput: 10.13 req/s
P50 Latency: 9.4s
P95 Latency: 19.6s
P99 Latency: 23.2s
Scaling Strategies
If Queue Fills Frequently:
- Increase
max_size: Add more queue capacity (e.g., 300) - Increase
default_concurrency_limit: Process more requests simultaneously (e.g., 75) - Upgrade HF Tier: Get more CPU/memory resources
- Multi-Space Setup: Load balance across multiple Spaces
If OpenAI Rate Limits Hit:
- Reduce
default_concurrency_limit: Lower to 30-40 - Implement Rate Limiting: Add per-user request throttling
- Request Tier 4 Limit: OpenAI ~$5000/month TPM limit
Configuration Files
app.pyline ~2350: Queue configurationhf_config.pyline ~30: Launch configuration with max_threads- Both files committed: Ready for deployment
Testing Commands
Local Load Test:
python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url http://localhost:7860
Production Load Test (HF Spaces):
python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/John-jero/IDWeekAgents
Summary
✅ Queue configured for 150 users ✅ Buffer capacity for burst traffic ✅ Graceful degradation under load ✅ Clear user feedback on wait times ✅ Production-ready configuration
The queue configuration provides a robust foundation for scaling to 150 concurrent users while maintaining good user experience.