# Warbler Pack Caching Strategy ## Overview The app now implements intelligent pack caching to avoid unnecessary re-ingestion of large datasets. This minimizes GitLab storage requirements and allows fast session startup. ## How It Works ### First Run (Session Start) 1. **PackManager** initializes and checks for cached metadata 2. **Health check** verifies if documents are already in the context store 3. **Ingestion** occurs only if: - No cache metadata exists - Pack count changed - Health check fails (documents missing) 4. **Cache** is saved with timestamp and document count ### Subsequent Runs - Reuses cached documents without re-ingestion - Quick health check ensures documents are still valid - Fallback to sample docs if packs unavailable ## Environment Variables Control pack ingestion behavior with these variables: ### `WARBLER_INGEST_PACKS` (default: `true`) Enable/disable automatic pack ingestion. ```bash export WARBLER_INGEST_PACKS=false ``` ### `WARBLER_SAMPLE_ONLY` (default: `false`) Load only sample documents (for CI/CD verification). ```bash export WARBLER_SAMPLE_ONLY=true ``` Best for: - PyPI package CI/CD pipelines - Quick verification that ingestion works - Minimal startup time in restricted environments ### `WARBLER_SKIP_PACK_CACHE` (default: `false`) Force reingest even if cache exists. ```bash export WARBLER_SKIP_PACK_CACHE=true ``` Best for: - Testing pack ingestion pipeline - Updating stale cache - Debugging ## Cache Location Default cache stored at: ```path ~/.warbler_cda/cache/pack_metadata.json ``` Metadata includes: ```json { "ingested_at": 1699564800, "pack_count": 7, "doc_count": 12345, "status": "healthy" } ``` ## CI/CD Optimization ### For GitLab CI (Minimal PyPI Package) ```yaml test: script: - export WARBLER_SAMPLE_ONLY=true - pip install . - python -m pytest tests/ ``` Benefits: - ✅ No large pack files in repository - ✅ Fast CI runs (5 samples vs 2.5M docs) - ✅ Verifies ingestion code works - ✅ Full packs load on first user session ### For Local Development Keep full packs in working directory: ```bash cd warbler-cda-package python -m warbler_cda.utils.hf_warbler_ingest ingest -d all python app.py ``` First run ingests all packs. Subsequent runs use cache. ### For Gradio Space/Cloud Deployment Set environment at deployment: ```bash WARBLER_INGEST_PACKS=true ``` Packs ingest once per session, then cached in instance memory. ## Files Affected - `app.py` - Main Gradio app with PackManager - `warbler_cda/utils/load_warbler_packs.py` - Pack discovery (already handles caching) - No changes needed to pack ingestion scripts ## Performance Impact ### Memory - **With packs**: ~500MB (2.5M arxiv docs + others) - **With samples**: ~1MB (5 test documents) ### Startup Time - **First run**: ~30-60 seconds (ingest packs) - **Cached run**: ~2-5 seconds (health check only) - **Sample only**: <1 second ## Troubleshooting ### Packs not loading? 1. Check `WARBLER_INGEST_PACKS=true` (default) 2. Verify packs exist: `ls -la packs/` 3. Force reingest: `export WARBLER_SKIP_PACK_CACHE=true` ### Cache corrupted? ```bash rm -rf ~/.warbler_cda/cache/pack_metadata.json ``` Will reingest on next run. ### Need sample docs only? ```bash export WARBLER_SAMPLE_ONLY=true python app.py ``` ## Future Improvements - [ ] Detect pack updates via file hash instead of just count - [ ] Selective pack loading (choose which datasets to cache) - [ ] Metrics dashboard showing cache hit/miss rates - [ ] Automatic cache expiration after N days