# OpenAI API Load Test Report - IDWeek Agents **Date:** October 10, 2025 **Test Configuration:** 150 concurrent users, 30-second duration --- ## Executive Summary ✅ **PASS** - The OpenAI API backend successfully handled 150 concurrent users with **100% success rate** and no errors. ### Key Metrics - **Throughput:** 10.13 requests/second - **Success Rate:** 100% (484/484 requests) - **Error Rate:** 0% - **Total Tokens:** 239,977 tokens (~496 tokens/request) - **Response Time (median):** 9.4 seconds - **Response Time (p95):** 14.3 seconds --- ## Detailed Analysis ### 1. Performance Assessment #### ✅ Strengths - **Zero failures:** No rate limit errors, timeouts, or API rejections - **Stable throughput:** Consistent 10 req/s across all user scenarios - **Token efficiency:** Avg 496 tokens/request is reasonable for complex clinical queries #### ⚠️ Observations - **Response latency:** 9.4s median response time is acceptable for async operations but may feel slow for interactive chat - **Concurrency handling:** OpenAI API handled 150 concurrent connections without throttling ### 2. Token Usage Breakdown | Operation Type | Requests | Tokens/Req (est) | |---|---|---| | Stewardship Recommendations | 144 (30%) | ~500 | | Clinical Assessment | 119 (25%) | ~600 | | Orchestrator Delegation | 83 (17%) | ~800 | | Simple Chat | 77 (16%) | ~300 | | Education Content | 61 (13%) | ~700 | **Projected Monthly Usage (150 users, 8hr/day, 30 days):** - Avg requests/user/day: ~64 (based on test pattern) - Total monthly tokens: ~2.3 billion tokens - Estimated cost @ gpt-4o-mini rates ($0.15/1M input, $0.60/1M output): - ~$920-1,380/month (depending on input/output ratio) ### 3. Rate Limit Status **OpenAI API Tier Limits (typical Tier 2):** - gpt-4o-mini: 30,000 RPM (requests per minute) - gpt-4o: 5,000 RPM **Current Load:** - Peak: ~600 req/min (10 req/s × 60s) - **Utilization:** 2% of gpt-4o-mini limit - **Headroom:** 49x current load before hitting rate limits ✅ **Verdict:** No rate limit concerns for 150 concurrent users. --- ## Response Time Analysis | Percentile | Response Time | User Experience | |---|---|---| | p50 (median) | 9.4s | Acceptable for async | | p95 | 14.3s | Borderline for interactive | | p99 | 16.3s | Slow; needs optimization | | Max | 18.0s | Unacceptable for chat | ### Recommendations: 1. **Streaming responses** (already implemented in app.py) - keeps users engaged during generation 2. **Response caching** for common queries (guidelines, educational content) 3. **Consider gpt-4o-mini** for all non-orchestrator tasks (faster, cheaper) --- ## Bottleneck Analysis ### OpenAI API Layer: ✅ No Bottlenecks - Zero errors - No rate limiting - Stable throughput - High headroom (49x capacity) ### Potential Bottlenecks (Next Phase - HF Spaces): 1. **Gradio concurrency limits** - default is 1-4 concurrent requests per Space 2. **Memory constraints** - HF Spaces have 16GB RAM limit 3. **Network I/O** - especially for session state management 4. **Database/session storage** - user isolation with 150 concurrent sessions --- ## Scaling Recommendations ### Immediate Actions (0-150 users) ✅ **No action needed** - current OpenAI setup is solid ### Growth Planning (150-500 users) 1. **Monitor token usage** - set up OpenAI usage alerts 2. **Implement response caching** with Redis/Upstash: - Cache guidelines (TTL: 24hr) - Cache PubMed results (TTL: 7 days) - Cache educational content (TTL: indefinite) 3. **Upgrade OpenAI tier** if needed (Tier 3: 90k RPM, Tier 4: 300k RPM) ### Scale-Out Strategy (500+ users) 1. **Load balancing:** Multiple HF Spaces behind a load balancer 2. **Queue management:** Background task processing for non-urgent queries 3. **CDN caching:** Static content (images, generated slides, educational materials) 4. **Database migration:** Move from in-memory session state to Redis Cluster --- ## Cost Projections ### Scenario A: 150 Active Users (Current Test) - **Requests/day:** ~9,600 (150 users × 64 req/user) - **Tokens/month:** ~2.3B tokens - **Monthly cost:** ~$920-1,380 ### Scenario B: 300 Active Users - **Tokens/month:** ~4.6B tokens - **Monthly cost:** ~$1,840-2,760 ### Scenario C: 500 Active Users - **Tokens/month:** ~7.7B tokens - **Monthly cost:** ~$3,080-4,620 **Note:** Costs assume 70% gpt-4o-mini / 30% gpt-4o mix. Pure gpt-4o-mini would reduce costs by 40-50%. --- ## Next Steps ### Phase 2: Hugging Face Spaces Load Test Test the full application stack including: - Gradio UI concurrency limits - Session state management (150 isolated sessions) - Database I/O for agent/chat storage - Network latency (user → HF Spaces → OpenAI → back) **Command to prepare:** ```bash python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/[your-space] ``` ### Monitoring Setup (Recommended) 1. **OpenAI Dashboard:** Track usage, errors, latency 2. **HF Spaces Logs:** Monitor memory, CPU, request queue depth 3. **Alerting:** Set thresholds for: - Error rate > 1% - p95 latency > 20s - Token usage spike (>2x baseline) --- ## Conclusion ✅ **The OpenAI API backend is production-ready for 150 concurrent users.** - Zero errors in load test - 49x capacity headroom before rate limits - Response times acceptable with streaming UI - Predictable token costs **Next critical test:** Hugging Face Spaces frontend to validate Gradio concurrency, session isolation, and end-to-end UX under load.