Spaces:

Bellok
/

warbler-cda

Runtime error

App Files Files Community

Bellok commited on 12 days ago

Commit

d90df2b

1 Parent(s): 08609d9

trying to allow ingesting in the background. Current throughput on ingest is 17+ documents per secon, that means only 25k documents out of over a million will load within HuggingFace's required time limit. If we want to have access to all 1.5m documents, we need to run the ingest in the background and allow Warbler-CDA to report healthy before the time-limit is up.

Browse files

Files changed (2) hide show

TODO.md +21 -31
app.py +95 -44

TODO.md CHANGED Viewed

@@ -1,39 +1,29 @@
-# NPC Dialog Language Pack Creation Plan
 ## Overview
-Create our own NPC dialog language pack by enhancing the synthetic NPCDialogueTransformer and generating a large dataset, replacing any references to copyrighted material.
 ## Tasks
-### 1. Remove References to Copyrighted Dataset
-- [ ] Remove `amaydle/npc-dialogue` references from README.md
-- [ ] Remove `amaydle/npc-dialogue` references from PKG-INFO
-- [ ] Remove `amaydle/npc-dialogue` references from VALIDATION_REPORT_MIT_DATASETS.md
-### 2. Update Pack Metadata
-- [ ] Update `packs/warbler-pack-npc-dialog/package.json` - change name from "warbler-pack-core" to "warbler-pack-npc-dialog"
-- [ ] Update `packs/warbler-pack-npc-dialog/README.md` - change content to focus on NPC dialog instead of core templates
-- [ ] Update `packs/warbler-pack-npc-dialog/README_HF_DATASET.md` - update metadata for NPC dialog focus
-### 3. Enhance NPCDialogueTransformer
-- [ ] Expand archetypes from 3 to 10+ diverse NPC types (merchant, guard, healer, mage, blacksmith, innkeeper, etc.)
-- [ ] Add more varied queries and responses per archetype
-- [ ] Increase default sample size from 50 to 1000+
-- [ ] Add more emotion types and dialogue contexts
-### 4. Update Templates (Optional)
-- [ ] Review `packs/warbler-pack-npc-dialog/pack/templates.json` for NPC-specific additions
-- [ ] Add templates for common NPC interactions if needed
-### 5. Generate Large Dataset
-- [ ] Run ingestion script with NPC dialogue transformer
-- [ ] Generate 1000+ synthetic dialogue entries
-- [ ] Validate generated jsonl file
-### 6. Test and Validate
-- [ ] Test pack loading and ingestion
-- [ ] Validate metadata and content structure
-- [ ] Ensure no copyrighted material references remain
 ## Status
 - [x] Plan created and approved

+# Background Pack Ingestion Implementation
 ## Overview
+Modify app.py to perform pack ingestion in a background thread, allowing the app to start immediately while documents load asynchronously.
 ## Tasks
+### 1. Add Background Ingestion Support
+- [ ] Import threading module in app.py
+- [ ] Add global variables to track ingestion status (running, progress, total_docs, processed, etc.)
+- [ ] Create a background_ingest_packs() function that performs the ingestion logic
+- [ ] Start the background thread after API initialization but before app launch
+### 2. Update System Stats
+- [ ] Modify get_system_stats() to include ingestion progress information
+- [ ] Display current ingestion status in the System Stats tab
+### 3. Handle Thread Safety
+- [ ] Ensure API.add_document() calls are thread-safe (assuming they are)
+- [ ] Add proper error handling in the background thread
+### 4. Test Implementation
+- [ ] Test that app launches immediately
+- [ ] Verify ingestion happens in background
+- [ ] Check that queries work during ingestion
+- [ ] Confirm progress is shown in System Stats
 ## Status
 - [x] Plan created and approved

app.py CHANGED Viewed

@@ -6,6 +6,7 @@ Interactive demo of the Cognitive Development Architecture RAG system
 import json
 import time
 import os
 import gradio as gr
 import hashlib
 import spaces
@@ -122,6 +123,17 @@ class PackManager:
 pack_manager = PackManager()
 try:
     from warbler_cda import (
         RetrievalAPI,
@@ -173,50 +185,15 @@ if WARBLER_AVAILABLE:
             pack_docs = pack_loader.discover_documents()
             if pack_docs and pack_manager.should_ingest_packs(api, len(pack_docs)):
-                print(f"[INFO] Ingesting {len(pack_docs)} documents from Warbler packs...")
-                total_docs = len(pack_docs)
-                processed = 0
-                failed = 0
-                start_time = time.time()
-                batch_size = 1000
-                # Process in batches to avoid memory issues and provide progress
-                for batch_start in range(0, total_docs, batch_size):
-                    batch_end = min(batch_start + batch_size, total_docs)
-                    batch = pack_docs[batch_start:batch_end]
-                    batch_processed = 0
-                    batch_failed = 0
-                    for doc in batch:
-                        success = api.add_document(doc["id"], doc["content"], doc["metadata"])
-                        if not success:
-                            batch_failed += 1
-                            failed += 1
-                            if failed <= 5:  # Log first few failures
-                                print(f"[WARN] Failed to add document {doc['id']}")
-                        batch_processed += 1
-                        processed += 1
-                    # Progress update after each batch
-                    elapsed = time.time() - start_time
-                    rate = processed / elapsed if elapsed > 0 else 0
-                    eta = (total_docs - processed) / rate if rate > 0 else 0
-                    print(f"[PROGRESS] {processed}/{total_docs} documents ingested "
-                          f"({processed/total_docs*100:.1f}%) - "
-                          f"{rate:.1f} docs/sec - ETA: {eta/60:.1f} min")
-                    # Force garbage collection after large batches to free memory
-                    if processed % 10000 == 0:
-                        import gc
-                        gc.collect()
-                packs_loaded = processed
-                pack_manager.mark_packs_ingested(1, packs_loaded)
-                total_time = time.time() - start_time
-                print(f"[OK] Loaded {packs_loaded} documents from Warbler packs "
-                      f"({failed} failed) in {total_time:.1f} seconds")
             elif pack_docs:
                 packs_loaded = len(pack_docs)
@@ -240,6 +217,72 @@ if WARBLER_AVAILABLE:
         traceback.print_exc()
 @spaces.GPU
 def query_warbler(
     query_text: str,
@@ -497,6 +540,14 @@ with gr.Blocks(title="Warbler CDA - RAG System Demo", theme=gr.themes.Soft()) as
         # Auto-load stats on tab open
         demo.load(fn=get_system_stats, outputs=stats_output)
     with gr.Tab("About"):
         gr.Markdown(
             """

 import json
 import time
 import os
+import threading
 import gradio as gr
 import hashlib
 import spaces
 pack_manager = PackManager()
+# Global variables for background ingestion tracking
+ingestion_status = {
+    "running": False,
+    "total_docs": 0,
+    "processed": 0,
+    "failed": 0,
+    "start_time": None,
+    "eta": 0,
+    "rate": 0,
+}
 try:
     from warbler_cda import (
         RetrievalAPI,
             pack_docs = pack_loader.discover_documents()
             if pack_docs and pack_manager.should_ingest_packs(api, len(pack_docs)):
+                # Start background ingestion
+                ingestion_thread = threading.Thread(
+                    target=background_ingest_packs,
+                    args=(api, pack_docs, pack_manager),
+                    daemon=True
+                )
+                ingestion_thread.start()
+                packs_loaded = 0  # Will be updated asynchronously
+                print(f"[INFO] Started background ingestion of {len(pack_docs)} documents")
             elif pack_docs:
                 packs_loaded = len(pack_docs)
         traceback.print_exc()
+def background_ingest_packs(api, pack_docs, pack_manager):
+    """Background function to ingest packs without blocking app startup"""
+    global ingestion_status
+    ingestion_status["running"] = True
+    ingestion_status["total_docs"] = len(pack_docs)
+    ingestion_status["processed"] = 0
+    ingestion_status["failed"] = 0
+    ingestion_status["start_time"] = time.time()
+    print(f"[INFO] Ingesting {len(pack_docs)} documents from Warbler packs...")
+    total_docs = len(pack_docs)
+    processed = 0
+    failed = 0
+    start_time = time.time()
+    batch_size = 1000
+    # Process in batches to avoid memory issues and provide progress
+    for batch_start in range(0, total_docs, batch_size):
+        batch_end = min(batch_start + batch_size, total_docs)
+        batch = pack_docs[batch_start:batch_end]
+        batch_processed = 0
+        batch_failed = 0
+        for doc in batch:
+            success = api.add_document(doc["id"], doc["content"], doc["metadata"])
+            if not success:
+                batch_failed += 1
+                failed += 1
+                if failed <= 5:  # Log first few failures
+                    print(f"[WARN] Failed to add document {doc['id']}")
+            batch_processed += 1
+            processed += 1
+        # Update global status
+        ingestion_status["processed"] = processed
+        ingestion_status["failed"] = failed
+        # Progress update after each batch
+        elapsed = time.time() - start_time
+        rate = processed / elapsed if elapsed > 0 else 0
+        eta = (total_docs - processed) / rate if rate > 0 else 0
+        ingestion_status["rate"] = rate
+        ingestion_status["eta"] = eta
+        print(f"[PROGRESS] {processed}/{total_docs} documents ingested "
+              f"({processed/total_docs*100:.1f}%) - "
+              f"{rate:.1f} docs/sec - ETA: {eta/60:.1f} min")
+        # Force garbage collection after large batches to free memory
+        if processed % 10000 == 0:
+            import gc
+            gc.collect()
+    packs_loaded = processed
+    pack_manager.mark_packs_ingested(1, packs_loaded)
+    total_time = time.time() - start_time
+    print(f"[OK] Loaded {packs_loaded} documents from Warbler packs "
+          f"({failed} failed) in {total_time:.1f} seconds")
+    # Mark ingestion complete
+    ingestion_status["running"] = False
 @spaces.GPU
 def query_warbler(
     query_text: str,
         # Auto-load stats on tab open
         demo.load(fn=get_system_stats, outputs=stats_output)
+        # Refresh stats every 10 seconds if ingestion is running
+        def auto_refresh_stats():
+            while ingestion_status["running"]:
+                time.sleep(10)
+                # Note: In Gradio, we can't directly update from background thread
+                # This would need a more complex setup with queues or websockets
+                # For now, users can manually refresh
     with gr.Tab("About"):
         gr.Markdown(
             """