Spaces:
Running
π User Guide - ZeroGPU LLM Inference
Quick Start (5 Minutes)
1. Choose Your Model
The model dropdown shows 30+ options organized by size:
- Compact (<2B): Fast, lightweight - great for quick responses
- Mid-size (2-8B): Best balance of speed and quality
- Large (14B+): Highest quality, slower but more capable
Recommendation for beginners: Start with Qwen3-4B-Instruct-2507
2. Try an Example Prompt
Click on any example below the chat box to get started:
- "Explain quantum computing in simple terms"
- "Write a Python function..."
- "What are the latest developments..." (requires web search)
3. Start Chatting!
Type your message and press Enter or click "π€ Send"
Core Features
π¬ Chat Interface
The main chat area shows:
- Your messages on one side
- AI responses with a π€ avatar
- Copy button on each message
- Smooth streaming as tokens generate
Tips:
- Press Enter to send (Shift+Enter for new line)
- Click Copy button to save responses
- Scroll up to review history
- Use Clear Chat to start fresh
π€ Model Selection
When to use each size:
| Model Size | Best For | Speed | Quality |
|---|---|---|---|
| <2B | Quick questions, testing | β‘β‘β‘ | ββ |
| 2-8B | General chat, coding help | β‘β‘ | βββ |
| 14B+ | Complex reasoning, long-form | β‘ | ββββ |
Specialized Models:
- Phi-4-mini-Reasoning: Math, logic problems
- Qwen2.5-Coder: Programming tasks
- DeepSeek-R1-Distill: Step-by-step reasoning
- Apriel-1.5-15b-Thinker: Multimodal understanding
π Web Search
Enable this when you need:
- Current events and news
- Recent information (after model training cutoff)
- Facts that change frequently
- Real-time data
How it works:
- Toggle "π Enable Web Search"
- Web search settings accordion appears
- System prompt updates automatically
- Search runs in background (won't block chat)
- Results injected into context
Settings explained:
- Max Results: How many search results to fetch (4 is good default)
- Max Chars/Result: Limit length per result (50 prevents overwhelming context)
- Search Timeout: Maximum wait time (5s recommended)
π System Prompt
This defines the AI's personality and behavior.
Default prompts:
- Without search: Helpful, creative assistant
- With search: Includes search results and current date
Customization ideas:
You are a professional code reviewer...
You are a creative writing coach...
You are a patient tutor explaining concepts simply...
You are a technical documentation writer...
Advanced Features
ποΈ Advanced Generation Parameters
Click the accordion to reveal these controls:
Max Tokens (64-16384)
- What it does: Sets maximum response length
- Lower (256-512): Quick, concise answers
- Medium (1024): Balanced (default)
- Higher (2048+): Long-form content, detailed explanations
Temperature (0.1-2.0)
- What it does: Controls randomness/creativity
- Low (0.1-0.3): Focused, deterministic (good for facts, code)
- Medium (0.7): Balanced creativity (default)
- High (1.2-2.0): Very creative, unpredictable (stories, brainstorming)
Top-K (1-100)
- What it does: Limits token choices to top K most likely
- Lower (10-20): More focused
- Medium (40): Balanced (default)
- Higher (80-100): More varied vocabulary
Top-P (0.1-1.0)
- What it does: Nucleus sampling threshold
- Lower (0.5-0.7): Conservative choices
- Medium (0.9): Balanced (default)
- Higher (0.95-1.0): Full vocabulary range
Repetition Penalty (1.0-2.0)
- What it does: Reduces repeated words/phrases
- Low (1.0-1.1): Allows some repetition
- Medium (1.2): Balanced (default)
- High (1.5+): Strongly avoids repetition (may hurt coherence)
Preset Configurations
For Creative Writing:
Temperature: 1.2
Top-P: 0.95
Top-K: 80
Max Tokens: 2048
For Code Generation:
Temperature: 0.3
Top-P: 0.9
Top-K: 40
Max Tokens: 1024
Repetition Penalty: 1.1
For Factual Q&A:
Temperature: 0.5
Top-P: 0.85
Top-K: 30
Max Tokens: 512
Enable Web Search: Yes
For Reasoning Tasks:
Model: Phi-4-mini-Reasoning or DeepSeek-R1
Temperature: 0.7
Max Tokens: 2048
Tips & Tricks
π― Getting Better Results
Be Specific: "Write a Python function to sort a list" β "Write a Python function that sorts a list of dictionaries by a specific key"
Provide Context: "Explain recursion" β "Explain recursion to someone learning programming for the first time, with a simple example"
Use System Prompts: Define role/expertise in system prompt instead of every message
Iterate: Use follow-up questions to refine responses
Experiment with Models: Try different models for the same task
β‘ Performance Tips
- Start Small: Test with smaller models first
- Adjust Max Tokens: Don't request more than you need
- Use Cancel: Stop bad generations early
- Clear Cache: Clear chat if experiencing slowdowns
- One Task at a Time: Don't send multiple requests simultaneously
π When to Use Web Search
β Good use cases:
- "What happened in the latest SpaceX launch?"
- "Current cryptocurrency prices"
- "Recent AI research papers"
- "Today's weather in Paris"
β Don't need search for:
- General knowledge questions
- Code writing/debugging
- Math problems
- Creative writing
- Theoretical explanations
π Understanding Thinking Mode
Some models output <think>...</think> blocks:
<think>
Let me break this down step by step...
First, I need to consider...
</think>
Here's the answer: ...
In the UI:
- Thinking shows as "π Thought"
- Answer shows separately
- Helps you see the reasoning process
Best for:
- Complex math problems
- Multi-step reasoning
- Debugging logic
- Learning how AI thinks
Troubleshooting
Generation is Slow
- Try a smaller model
- Reduce Max Tokens
- Disable web search if not needed
- Clear chat history
Responses are Repetitive
- Increase Repetition Penalty
- Reduce Temperature slightly
- Try different model
Responses are Random/Nonsensical
- Decrease Temperature
- Reduce Top-P
- Reduce Top-K
- Try more stable model
Web Search Not Working
- Check timeout isn't too short
- Verify internet connection
- Try increasing Max Results
- Check search query in debug panel
Cancel Button Doesn't Work
- Wait a moment (might be processing)
- Refresh page if persists
- Check browser console for errors
Keyboard Shortcuts
- Enter: Send message
- Shift+Enter: New line in input
- Ctrl+C: Copy (when text selected)
- Ctrl+A: Select all in input
Best Practices
For Beginners
- Start with example prompts
- Use default settings initially
- Try 2-4 different models
- Gradually explore advanced settings
- Read responses fully before replying
For Power Users
- Create custom system prompts
- Fine-tune parameters per task
- Use debug panel for prompt engineering
- Experiment with model combinations
- Utilize web search strategically
For Developers
- Study the debug output
- Test code generation thoroughly
- Use lower temperature for determinism
- Compare multiple models
- Save working configurations
Privacy & Safety
- No data collection: Conversations not stored permanently
- Model limitations: May produce incorrect information
- Verify important info: Don't rely solely on AI for critical decisions
- Web search: Uses DuckDuckGo (privacy-focused)
- Open source: Code is transparent and auditable
Support & Feedback
Found a bug? Have a suggestion?
- Check GitHub issues
- Submit feature requests
- Contribute improvements
- Share your use cases
Happy chatting! π