Spaces:
				
			
			
	
			
			
		Sleeping
		
	
	
	
			
			
	
	
	
	
		
		
		Sleeping
		
	| # Gradio Queue Configuration for 150 Concurrent Users | |
| ## Overview | |
| The application has been configured with optimal Gradio queue settings to handle 150 concurrent users on Hugging Face Spaces. | |
| ## Configuration Details | |
| ### Queue Settings (in `app.py`) | |
| ```python | |
| app.queue( | |
| max_size=200, # Allow up to 200 requests in queue | |
| default_concurrency_limit=50 # Process up to 50 requests concurrently | |
| ) | |
| ``` | |
| ### Launch Settings (in `hf_config.py`) | |
| ```python | |
| config = { | |
| "max_threads": 100 # Allow more worker threads for concurrent requests | |
| } | |
| ``` | |
| ## How It Works | |
| ### 1. **Request Queue (`max_size=200`)** | |
| - Acts as a buffer for incoming requests | |
| - Allows 200 requests to wait in queue (33% buffer above 150 users) | |
| - Prevents server overload by managing request flow | |
| - Requests beyond 200 are rejected with "Queue Full" message | |
| ### 2. **Concurrency Limit (`default_concurrency_limit=50`)** | |
| - Processes up to 50 requests simultaneously | |
| - Balances performance vs. resource usage | |
| - Prevents OpenAI API rate limit exhaustion | |
| - Each request includes: | |
| - GPT-4o chat completion | |
| - Tool execution (recommend_deescalation, empiric_therapy, etc.) | |
| - History management | |
| ### 3. **Worker Threads (`max_threads=100`)** | |
| - Handles I/O-bound operations (API calls, streaming) | |
| - Allows efficient concurrent processing | |
| - Supports async operations without blocking | |
| ## Performance Expectations | |
| ### For 150 Concurrent Users: | |
| - **Queue Buffer**: 50 extra slots (200 total) for burst traffic | |
| - **Concurrent Processing**: 50 requests active at once | |
| - **Average Wait Time**: | |
| - Low load (< 50 users): ~0-2 seconds | |
| - Medium load (50-100 users): ~2-10 seconds | |
| - High load (100-150 users): ~10-30 seconds | |
| - Burst load (> 150 users): Queue position displayed | |
| ### Hugging Face Spaces Tier Recommendations: | |
| | Tier | Users | Queue Behavior | | |
| |------|-------|----------------| | |
| | **Free** | 1-4 | Queue works, but limited to 4 concurrent users | | |
| | **Pro ($30/mo)** | 50-75 | Queue enables ~75 users, but may see longer waits | | |
| | **Pro+ ($60/mo)** | 100-120 | Queue enables ~120 users with reasonable wait times | | |
| | **Enterprise ($500+/mo)** | 150+ | Full 150 user support with optimal performance | | |
| ## Queue User Experience | |
| ### What Users See: | |
| 1. **Low Load**: Instant response | |
| 2. **Medium Load**: "Processing..." indicator | |
| 3. **High Load**: "You are #X in queue" message | |
| 4. **Queue Full**: "Too many requests, please try again" | |
| ### Graceful Degradation: | |
| - Queue prevents crashes under load | |
| - Users get clear feedback on wait times | |
| - Failed requests can be retried | |
| - No data loss during high traffic | |
| ## Monitoring Recommendations | |
| ### Key Metrics to Watch: | |
| 1. **Queue Length**: Should stay < 150 under normal load | |
| 2. **Wait Times**: Average < 10s for good UX | |
| 3. **Rejection Rate**: < 5% indicates healthy capacity | |
| 4. **OpenAI API Latency**: Monitor p95/p99 response times | |
| ### Load Test Results (from previous test): | |
| ``` | |
| Total Requests: 484 | |
| Success Rate: 100% | |
| Throughput: 10.13 req/s | |
| P50 Latency: 9.4s | |
| P95 Latency: 19.6s | |
| P99 Latency: 23.2s | |
| ``` | |
| ## Scaling Strategies | |
| ### If Queue Fills Frequently: | |
| 1. **Increase `max_size`**: Add more queue capacity (e.g., 300) | |
| 2. **Increase `default_concurrency_limit`**: Process more requests simultaneously (e.g., 75) | |
| 3. **Upgrade HF Tier**: Get more CPU/memory resources | |
| 4. **Multi-Space Setup**: Load balance across multiple Spaces | |
| ### If OpenAI Rate Limits Hit: | |
| 1. **Reduce `default_concurrency_limit`**: Lower to 30-40 | |
| 2. **Implement Rate Limiting**: Add per-user request throttling | |
| 3. **Request Tier 4 Limit**: OpenAI ~$5000/month TPM limit | |
| ## Configuration Files | |
| - **`app.py` line ~2350**: Queue configuration | |
| - **`hf_config.py` line ~30**: Launch configuration with max_threads | |
| - **Both files committed**: Ready for deployment | |
| ## Testing Commands | |
| ### Local Load Test: | |
| ```bash | |
| python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url http://localhost:7860 | |
| ``` | |
| ### Production Load Test (HF Spaces): | |
| ```bash | |
| python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/John-jero/IDWeekAgents | |
| ``` | |
| ## Summary | |
| β **Queue configured for 150 users** | |
| β **Buffer capacity for burst traffic** | |
| β **Graceful degradation under load** | |
| β **Clear user feedback on wait times** | |
| β **Production-ready configuration** | |
| The queue configuration provides a robust foundation for scaling to 150 concurrent users while maintaining good user experience. | |
