--- title: ML Research Paper RAG Chatbot emoji: 🤖 colorFrom: blue colorTo: purple sdk: docker app_port: 8501 tags: - streamlit - machine-learning - research - rag - chatbot pinned: false short_description: AI-powered chatbot for ML research papers --- # 📄 ML Research Paper RAG Chatbot An intelligent research assistant that helps you discover, understand, and explore Machine Learning research papers from ArXiv using Retrieval-Augmented Generation (RAG). ## 🎯 What is this? This chatbot uses advanced AI to help you: - 🔍 **Find relevant research papers** on any ML topic - 📚 **Get detailed explanations** from published research - 💡 **Understand complex concepts** with cited sources - 🎓 **Stay updated** with ML research trends ## ✨ Features - **Multi-LLM Support**: Choose between Anthropic Claude, Google Gemini, or Groq - **Smart Retrieval**: FAISS vector store with semantic search and reranking - **Research-Focused**: Only provides answers based on actual papers (no hallucinations) - **Citation-Backed**: All responses cite source papers with metadata - **Interactive UI**: Clean Streamlit interface with helpful guides ## 🚀 Quick Start Guide ### For First-Time Users 1. **Start a conversation** by typing a question in the chat box 2. **Try example queries** using the quick action buttons 3. **Explore results** by expanding the "View Retrieved Documents" section 4. **Adjust settings** in the sidebar for fine-tuned results ### Example Queries ``` ✅ Find papers on handling imbalanced datasets ✅ What methods are used for fraud detection in ML? ✅ Explain the attention mechanism in transformers ✅ List recent papers about reinforcement learning ✅ How does batch normalization improve training? ``` ## 💡 Tips for Best Results ### Ask Better Questions - ✅ **Be specific**: "fraud detection in credit cards" > "fraud" - ✅ **Use ML terminology**: "convolutional neural networks" > "image AI" - ✅ **Ask for comparisons**: "Compare CNN vs RNN for sequences" ### Understand the Responses - 📚 All answers are based on research papers in the database - 🔍 Check "View Retrieved Documents" to see sources - ⚠️ If documents seem irrelevant, try rephrasing ### Advanced Usage - ⚙️ Adjust retrieval settings (base_k, rerank_k) for more/fewer papers - 🎨 Switch LLM providers for different response styles - 📅 Filter by year or category for focused results ## 🗂️ Dataset Uses **CShorten/ML-ArXiv-Papers** from Hugging Face: - Curated Machine Learning research papers from ArXiv - Includes titles, abstracts, metadata, and citations - Regularly updated with new publications ## ⚙️ Configuration ### LLM Providers 1. **Anthropic Claude** (Recommended for quality) - claude-3-5-sonnet-20241022 (Best balance) - claude-3-5-haiku-20241022 (Fast) 2. **Google Gemini** (Good for free tier) - gemini-2.5-flash (Fast and efficient) 3. **Groq** (Fastest inference) - llama-4-maverick-17b (Open source) ### Retrieval Settings - **base_k**: Initial papers fetched (4-30, default: 20) - **rerank_k**: Final papers after reranking (1-12, default: 8) - **Dynamic k**: Auto-adjust based on query - **Reranking**: Improve relevance with cross-encoder ## 🔧 Setup (For Developers) ### Prerequisites ```bash pip install -r requirements.txt ``` ### API Keys Create a `.env` file: ```env ANTHROPIC_API_KEY=your-key-here GEMINI_API_KEY=your-key-here GROQ_API_KEY=your-key-here ``` ### Run Locally ```bash streamlit run streamlit_app.py ``` ## 📊 How It Works 1. **User Query** → Semantic embedding created 2. **Vector Search** → FAISS retrieves similar papers 3. **Reranking** → Cross-encoder scores relevance 4. **LLM Generation** → AI generates answer from papers 5. **Response** → Cited answer with source papers ## 🔒 Important Notes - ✅ Answers are **based only on research papers** in the database - ✅ System won't make up information from general knowledge - ✅ If no relevant papers found, it will tell you - ❌ Not a replacement for reading the full papers - ⚠️ Always verify critical information with original sources ## 🤝 Contributing Feel free to submit issues, fork the repository, and create pull requests for any improvements. ## 📄 License This project is open source and available under the MIT License. --- **Ready to explore ML research?** Start by asking a question! 🚀