# OpenAI API Load Test Report - IDWeek Agents
**Date:** October 10, 2025  
**Test Configuration:** 150 concurrent users, 30-second duration

---

## Executive Summary

✅ **PASS** - The OpenAI API backend successfully handled 150 concurrent users with **100% success rate** and no errors.

### Key Metrics
- **Throughput:** 10.13 requests/second
- **Success Rate:** 100% (484/484 requests)
- **Error Rate:** 0%
- **Total Tokens:** 239,977 tokens (~496 tokens/request)
- **Response Time (median):** 9.4 seconds
- **Response Time (p95):** 14.3 seconds

---

## Detailed Analysis

### 1. Performance Assessment

#### ✅ Strengths
- **Zero failures:** No rate limit errors, timeouts, or API rejections
- **Stable throughput:** Consistent 10 req/s across all user scenarios
- **Token efficiency:** Avg 496 tokens/request is reasonable for complex clinical queries

#### ⚠️ Observations
- **Response latency:** 9.4s median response time is acceptable for async operations but may feel slow for interactive chat
- **Concurrency handling:** OpenAI API handled 150 concurrent connections without throttling

### 2. Token Usage Breakdown

| Operation Type | Requests | Tokens/Req (est) |
|---|---|---|
| Stewardship Recommendations | 144 (30%) | ~500 |
| Clinical Assessment | 119 (25%) | ~600 |
| Orchestrator Delegation | 83 (17%) | ~800 |
| Simple Chat | 77 (16%) | ~300 |
| Education Content | 61 (13%) | ~700 |

**Projected Monthly Usage (150 users, 8hr/day, 30 days):**
- Avg requests/user/day: ~64 (based on test pattern)
- Total monthly tokens: ~2.3 billion tokens
- Estimated cost @ gpt-4o-mini rates ($0.15/1M input, $0.60/1M output):
  - ~$920-1,380/month (depending on input/output ratio)

### 3. Rate Limit Status

**OpenAI API Tier Limits (typical Tier 2):**
- gpt-4o-mini: 30,000 RPM (requests per minute)
- gpt-4o: 5,000 RPM

**Current Load:**
- Peak: ~600 req/min (10 req/s × 60s)
- **Utilization:** 2% of gpt-4o-mini limit
- **Headroom:** 49x current load before hitting rate limits

✅ **Verdict:** No rate limit concerns for 150 concurrent users.

---

## Response Time Analysis

| Percentile | Response Time | User Experience |
|---|---|---|
| p50 (median) | 9.4s | Acceptable for async |
| p95 | 14.3s | Borderline for interactive |
| p99 | 16.3s | Slow; needs optimization |
| Max | 18.0s | Unacceptable for chat |

### Recommendations:
1. **Streaming responses** (already implemented in app.py) - keeps users engaged during generation
2. **Response caching** for common queries (guidelines, educational content)
3. **Consider gpt-4o-mini** for all non-orchestrator tasks (faster, cheaper)

---

## Bottleneck Analysis

### OpenAI API Layer: ✅ No Bottlenecks
- Zero errors
- No rate limiting
- Stable throughput
- High headroom (49x capacity)

### Potential Bottlenecks (Next Phase - HF Spaces):
1. **Gradio concurrency limits** - default is 1-4 concurrent requests per Space
2. **Memory constraints** - HF Spaces have 16GB RAM limit
3. **Network I/O** - especially for session state management
4. **Database/session storage** - user isolation with 150 concurrent sessions

---

## Scaling Recommendations

### Immediate Actions (0-150 users)
✅ **No action needed** - current OpenAI setup is solid

### Growth Planning (150-500 users)
1. **Monitor token usage** - set up OpenAI usage alerts
2. **Implement response caching** with Redis/Upstash:
   - Cache guidelines (TTL: 24hr)
   - Cache PubMed results (TTL: 7 days)
   - Cache educational content (TTL: indefinite)
3. **Upgrade OpenAI tier** if needed (Tier 3: 90k RPM, Tier 4: 300k RPM)

### Scale-Out Strategy (500+ users)
1. **Load balancing:** Multiple HF Spaces behind a load balancer
2. **Queue management:** Background task processing for non-urgent queries
3. **CDN caching:** Static content (images, generated slides, educational materials)
4. **Database migration:** Move from in-memory session state to Redis Cluster

---

## Cost Projections

### Scenario A: 150 Active Users (Current Test)
- **Requests/day:** ~9,600 (150 users × 64 req/user)
- **Tokens/month:** ~2.3B tokens
- **Monthly cost:** ~$920-1,380

### Scenario B: 300 Active Users
- **Tokens/month:** ~4.6B tokens
- **Monthly cost:** ~$1,840-2,760

### Scenario C: 500 Active Users
- **Tokens/month:** ~7.7B tokens
- **Monthly cost:** ~$3,080-4,620

**Note:** Costs assume 70% gpt-4o-mini / 30% gpt-4o mix. Pure gpt-4o-mini would reduce costs by 40-50%.

---

## Next Steps

### Phase 2: Hugging Face Spaces Load Test
Test the full application stack including:
- Gradio UI concurrency limits
- Session state management (150 isolated sessions)
- Database I/O for agent/chat storage
- Network latency (user → HF Spaces → OpenAI → back)

**Command to prepare:**
```bash
python scripts/load_test_huggingface_spaces.py --users 150 --duration 60 --url https://huggingface.co/spaces/[your-space]
```

### Monitoring Setup (Recommended)
1. **OpenAI Dashboard:** Track usage, errors, latency
2. **HF Spaces Logs:** Monitor memory, CPU, request queue depth
3. **Alerting:** Set thresholds for:
   - Error rate > 1%
   - p95 latency > 20s
   - Token usage spike (>2x baseline)

---

## Conclusion

✅ **The OpenAI API backend is production-ready for 150 concurrent users.**

- Zero errors in load test
- 49x capacity headroom before rate limits
- Response times acceptable with streaming UI
- Predictable token costs

**Next critical test:** Hugging Face Spaces frontend to validate Gradio concurrency, session isolation, and end-to-end UX under load.