IDAgentsFreshTest / QUEUE_CONFIGURATION.md
IDAgents Developer
Configure Gradio queue for 150 concurrent users
16142f3
|
raw
history blame
4.46 kB

Gradio Queue Configuration for 150 Concurrent Users

Overview

The application has been configured with optimal Gradio queue settings to handle 150 concurrent users on Hugging Face Spaces.

Configuration Details

Queue Settings (in app.py)

app.queue(
    max_size=200,           # Allow up to 200 requests in queue
    default_concurrency_limit=50  # Process up to 50 requests concurrently
)

Launch Settings (in hf_config.py)

config = {
    "max_threads": 100  # Allow more worker threads for concurrent requests
}

How It Works

1. Request Queue (max_size=200)

  • Acts as a buffer for incoming requests
  • Allows 200 requests to wait in queue (33% buffer above 150 users)
  • Prevents server overload by managing request flow
  • Requests beyond 200 are rejected with "Queue Full" message

2. Concurrency Limit (default_concurrency_limit=50)

  • Processes up to 50 requests simultaneously
  • Balances performance vs. resource usage
  • Prevents OpenAI API rate limit exhaustion
  • Each request includes:
    • GPT-4o chat completion
    • Tool execution (recommend_deescalation, empiric_therapy, etc.)
    • History management

3. Worker Threads (max_threads=100)

  • Handles I/O-bound operations (API calls, streaming)
  • Allows efficient concurrent processing
  • Supports async operations without blocking

Performance Expectations

For 150 Concurrent Users:

  • Queue Buffer: 50 extra slots (200 total) for burst traffic
  • Concurrent Processing: 50 requests active at once
  • Average Wait Time:
    • Low load (< 50 users): ~0-2 seconds
    • Medium load (50-100 users): ~2-10 seconds
    • High load (100-150 users): ~10-30 seconds
    • Burst load (> 150 users): Queue position displayed

Hugging Face Spaces Tier Recommendations:

Tier Users Queue Behavior
Free 1-4 Queue works, but limited to 4 concurrent users
Pro ($30/mo) 50-75 Queue enables ~75 users, but may see longer waits
Pro+ ($60/mo) 100-120 Queue enables ~120 users with reasonable wait times
Enterprise ($500+/mo) 150+ Full 150 user support with optimal performance

Queue User Experience

What Users See:

  1. Low Load: Instant response
  2. Medium Load: "Processing..." indicator
  3. High Load: "You are #X in queue" message
  4. Queue Full: "Too many requests, please try again"

Graceful Degradation:

  • Queue prevents crashes under load
  • Users get clear feedback on wait times
  • Failed requests can be retried
  • No data loss during high traffic

Monitoring Recommendations

Key Metrics to Watch:

  1. Queue Length: Should stay < 150 under normal load
  2. Wait Times: Average < 10s for good UX
  3. Rejection Rate: < 5% indicates healthy capacity
  4. OpenAI API Latency: Monitor p95/p99 response times

Load Test Results (from previous test):

Total Requests: 484
Success Rate: 100%
Throughput: 10.13 req/s
P50 Latency: 9.4s
P95 Latency: 19.6s
P99 Latency: 23.2s

Scaling Strategies

If Queue Fills Frequently:

  1. Increase max_size: Add more queue capacity (e.g., 300)
  2. Increase default_concurrency_limit: Process more requests simultaneously (e.g., 75)
  3. Upgrade HF Tier: Get more CPU/memory resources
  4. Multi-Space Setup: Load balance across multiple Spaces

If OpenAI Rate Limits Hit:

  1. Reduce default_concurrency_limit: Lower to 30-40
  2. Implement Rate Limiting: Add per-user request throttling
  3. Request Tier 4 Limit: OpenAI ~$5000/month TPM limit

Configuration Files

  • app.py line ~2350: Queue configuration
  • hf_config.py line ~30: Launch configuration with max_threads
  • Both files committed: Ready for deployment

Testing Commands

Local Load Test:

python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url http://localhost:7860

Production Load Test (HF Spaces):

python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/John-jero/IDWeekAgents

Summary

Queue configured for 150 usersBuffer capacity for burst trafficGraceful degradation under loadClear user feedback on wait timesProduction-ready configuration

The queue configuration provides a robust foundation for scaling to 150 concurrent users while maintaining good user experience.