Spaces:

Luigi
/

ZeroGPU-LLM-Inference

Running

App Files Files Community

ZeroGPU-LLM-Inference / USER_GUIDE.md

Luigi

Add comprehensive documentation and user guide

c1bc514 26 days ago

preview code

raw

history blame contribute delete

7.9 kB

📖 User Guide - ZeroGPU LLM Inference

Quick Start (5 Minutes)

1. Choose Your Model

The model dropdown shows 30+ options organized by size:

Compact (<2B): Fast, lightweight - great for quick responses
Mid-size (2-8B): Best balance of speed and quality
Large (14B+): Highest quality, slower but more capable

Recommendation for beginners: Start with Qwen3-4B-Instruct-2507

2. Try an Example Prompt

Click on any example below the chat box to get started:

"Explain quantum computing in simple terms"
"Write a Python function..."
"What are the latest developments..." (requires web search)

3. Start Chatting!

Type your message and press Enter or click "📤 Send"

Core Features

💬 Chat Interface

The main chat area shows:

Your messages on one side
AI responses with a 🤖 avatar
Copy button on each message
Smooth streaming as tokens generate

Tips:

Press Enter to send (Shift+Enter for new line)
Click Copy button to save responses
Scroll up to review history
Use Clear Chat to start fresh

🤖 Model Selection

When to use each size:

Model Size	Best For	Speed	Quality
<2B	Quick questions, testing	⚡⚡⚡	⭐⭐
2-8B	General chat, coding help	⚡⚡	⭐⭐⭐
14B+	Complex reasoning, long-form	⚡	⭐⭐⭐⭐

Specialized Models:

Phi-4-mini-Reasoning: Math, logic problems
Qwen2.5-Coder: Programming tasks
DeepSeek-R1-Distill: Step-by-step reasoning
Apriel-1.5-15b-Thinker: Multimodal understanding

🔍 Web Search

Enable this when you need:

Current events and news
Recent information (after model training cutoff)
Facts that change frequently
Real-time data

How it works:

Toggle "🔍 Enable Web Search"
Web search settings accordion appears
System prompt updates automatically
Search runs in background (won't block chat)
Results injected into context

Settings explained:

Max Results: How many search results to fetch (4 is good default)
Max Chars/Result: Limit length per result (50 prevents overwhelming context)
Search Timeout: Maximum wait time (5s recommended)

📝 System Prompt

This defines the AI's personality and behavior.

Default prompts:

Without search: Helpful, creative assistant
With search: Includes search results and current date

Customization ideas:

You are a professional code reviewer...
You are a creative writing coach...
You are a patient tutor explaining concepts simply...
You are a technical documentation writer...

Advanced Features

🎛️ Advanced Generation Parameters

Click the accordion to reveal these controls:

Max Tokens (64-16384)

What it does: Sets maximum response length
Lower (256-512): Quick, concise answers
Medium (1024): Balanced (default)
Higher (2048+): Long-form content, detailed explanations

Temperature (0.1-2.0)

What it does: Controls randomness/creativity
Low (0.1-0.3): Focused, deterministic (good for facts, code)
Medium (0.7): Balanced creativity (default)
High (1.2-2.0): Very creative, unpredictable (stories, brainstorming)

Top-K (1-100)

What it does: Limits token choices to top K most likely
Lower (10-20): More focused
Medium (40): Balanced (default)
Higher (80-100): More varied vocabulary

Top-P (0.1-1.0)

What it does: Nucleus sampling threshold
Lower (0.5-0.7): Conservative choices
Medium (0.9): Balanced (default)
Higher (0.95-1.0): Full vocabulary range

Repetition Penalty (1.0-2.0)

What it does: Reduces repeated words/phrases
Low (1.0-1.1): Allows some repetition
Medium (1.2): Balanced (default)
High (1.5+): Strongly avoids repetition (may hurt coherence)

Preset Configurations

For Creative Writing:

Temperature: 1.2
Top-P: 0.95
Top-K: 80
Max Tokens: 2048

For Code Generation:

Temperature: 0.3
Top-P: 0.9
Top-K: 40
Max Tokens: 1024
Repetition Penalty: 1.1

For Factual Q&A:

Temperature: 0.5
Top-P: 0.85
Top-K: 30
Max Tokens: 512
Enable Web Search: Yes

For Reasoning Tasks:

Model: Phi-4-mini-Reasoning or DeepSeek-R1
Temperature: 0.7
Max Tokens: 2048

Tips & Tricks

🎯 Getting Better Results

Be Specific: "Write a Python function to sort a list" → "Write a Python function that sorts a list of dictionaries by a specific key"
Provide Context: "Explain recursion" → "Explain recursion to someone learning programming for the first time, with a simple example"
Use System Prompts: Define role/expertise in system prompt instead of every message
Iterate: Use follow-up questions to refine responses
Experiment with Models: Try different models for the same task

⚡ Performance Tips

Start Small: Test with smaller models first
Adjust Max Tokens: Don't request more than you need
Use Cancel: Stop bad generations early
Clear Cache: Clear chat if experiencing slowdowns
One Task at a Time: Don't send multiple requests simultaneously

🔍 When to Use Web Search

✅ Good use cases:

"What happened in the latest SpaceX launch?"
"Current cryptocurrency prices"
"Recent AI research papers"
"Today's weather in Paris"

❌ Don't need search for:

General knowledge questions
Code writing/debugging
Math problems
Creative writing
Theoretical explanations

💭 Understanding Thinking Mode

Some models output <think>...</think> blocks:

<think>
Let me break this down step by step...
First, I need to consider...
</think>

Here's the answer: ...

In the UI:

Thinking shows as "💭 Thought"
Answer shows separately
Helps you see the reasoning process

Best for:

Complex math problems
Multi-step reasoning
Debugging logic
Learning how AI thinks

Troubleshooting

Generation is Slow

Try a smaller model
Reduce Max Tokens
Disable web search if not needed
Clear chat history

Responses are Repetitive

Increase Repetition Penalty
Reduce Temperature slightly
Try different model

Responses are Random/Nonsensical

Decrease Temperature
Reduce Top-P
Reduce Top-K
Try more stable model

Web Search Not Working

Check timeout isn't too short
Verify internet connection
Try increasing Max Results
Check search query in debug panel

Cancel Button Doesn't Work

Wait a moment (might be processing)
Refresh page if persists
Check browser console for errors

Keyboard Shortcuts

Enter: Send message
Shift+Enter: New line in input
Ctrl+C: Copy (when text selected)
Ctrl+A: Select all in input

Best Practices

For Beginners

Start with example prompts
Use default settings initially
Try 2-4 different models
Gradually explore advanced settings
Read responses fully before replying

For Power Users

Create custom system prompts
Fine-tune parameters per task
Use debug panel for prompt engineering
Experiment with model combinations
Utilize web search strategically

For Developers

Study the debug output
Test code generation thoroughly
Use lower temperature for determinism
Compare multiple models
Save working configurations

Privacy & Safety

No data collection: Conversations not stored permanently
Model limitations: May produce incorrect information
Verify important info: Don't rely solely on AI for critical decisions
Web search: Uses DuckDuckGo (privacy-focused)
Open source: Code is transparent and auditable

Support & Feedback

Found a bug? Have a suggestion?

Check GitHub issues
Submit feature requests
Contribute improvements
Share your use cases

Happy chatting! 🎉