Spaces:
Running
Running
metadata
title: ZeroGPU-LLM-Inference
emoji: π§
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Streaming LLM chat with web search and controls
π§ ZeroGPU LLM Inference
A modern, user-friendly Gradio interface for token-streaming, chat-style inference across a wide variety of Transformer modelsβpowered by ZeroGPU for free GPU acceleration on Hugging Face Spaces.
β¨ Key Features
π¨ Modern UI/UX
- Clean, intuitive interface with organized layout and visual hierarchy
- Collapsible advanced settings for both simple and power users
- Smooth animations and transitions for better user experience
- Responsive design that works on all screen sizes
- Copy-to-clipboard functionality for easy sharing of responses
π Web Search Integration
- Real-time DuckDuckGo search with background threading
- Configurable timeout and result limits
- Automatic context injection into system prompts
- Smart toggle - search settings auto-hide when disabled
π‘ Smart Features
- Thought vs. Answer streaming:
<think>β¦</think>blocks shown separately as "π Thought" - Working cancel button - immediately stops generation without errors
- Debug panel for prompt engineering insights
- Duration estimates based on model size and settings
- Example prompts to help users get started
- Dynamic system prompts with automatic date insertion
π― Model Variety
- 30+ LLM options from leading providers (Qwen, Microsoft, Meta, Mistral, etc.)
- Models ranging from 135M to 32B+ parameters
- Specialized models for reasoning, coding, and general chat
- Efficient model loading - one at a time with automatic cache clearing
βοΈ Advanced Controls
- Generation parameters: max tokens, temperature, top-k, top-p, repetition penalty
- Web search settings: max results, chars per result, timeout
- Custom system prompts with dynamic date insertion
- Organized in collapsible sections to keep interface clean
π Supported Models
Compact Models (< 2B)
- SmolLM2-135M-Instruct - Tiny but capable
- SmolLM2-360M-Instruct - Lightweight conversation
- Taiwan-ELM-270M/1.1B - Multilingual support
- Qwen3-0.6B/1.7B - Fast inference
Mid-Size Models (2B-8B)
- Qwen3-4B/8B - Balanced performance
- Phi-4-mini (4.3B) - Reasoning & Instruct variants
- MiniCPM3-4B - Efficient mid-size
- Gemma-3-4B-IT - Instruction-tuned
- Llama-3.2-Taiwan-3B - Regional optimization
- Mistral-7B-Instruct - Classic performer
- DeepSeek-R1-Distill-Llama-8B - Reasoning specialist
Large Models (14B+)
- Qwen3-14B - Strong general purpose
- Apriel-1.5-15b-Thinker - Multimodal reasoning
- gpt-oss-20b - Open GPT-style
- Qwen3-32B - Top-tier performance
π How It Works
- Select Model - Choose from 30+ pre-configured models
- Configure Settings - Adjust generation parameters or use defaults
- Enable Web Search (optional) - Get real-time information
- Start Chatting - Type your message or use example prompts
- Stream Response - Watch as tokens are generated in real-time
- Cancel Anytime - Stop generation mid-stream if needed
Technical Flow
- User message enters chat history
- If search enabled, background thread fetches DuckDuckGo results
- Search snippets merge into system prompt (within timeout limit)
- Selected model pipeline loads on ZeroGPU (bf16βf16βf32 fallback)
- Prompt formatted with thinking mode detection
- Tokens stream to UI with thought/answer separation
- Cancel button available for immediate interruption
- Memory cleared after generation for next request
βοΈ Generation Parameters
| Parameter | Range | Default | Description |
|---|---|---|---|
| Max Tokens | 64-16384 | 1024 | Maximum response length |
| Temperature | 0.1-2.0 | 0.7 | Creativity vs focus |
| Top-K | 1-100 | 40 | Token sampling pool size |
| Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold |
| Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition |
π Web Search Settings
| Setting | Range | Default | Description |
|---|---|---|---|
| Max Results | Integer | 4 | Number of search results |
| Max Chars/Result | Integer | 50 | Character limit per result |
| Search Timeout | 0-30s | 5s | Maximum wait time |
π» Local Development
# Clone the repository
git clone https://huggingface.co/spaces/Luigi/ZeroGPU-LLM-Inference
cd ZeroGPU-LLM-Inference
# Install dependencies
pip install -r requirements.txt
# Run the app
python app.py
π¨ UI Design Philosophy
The interface follows these principles:
- Simplicity First - Core features immediately visible
- Progressive Disclosure - Advanced options hidden but accessible
- Visual Hierarchy - Clear organization with groups and sections
- Feedback - Status indicators and helpful messages
- Accessibility - Responsive, keyboard-friendly, with tooltips
π§ Customization
Adding New Models
Edit MODELS dictionary in app.py:
"Your-Model-Name": {
"repo_id": "org/model-name",
"description": "Model description",
"params_b": 7.0 # Size in billions
}
Modifying UI Theme
Adjust theme parameters in gr.Blocks():
theme=gr.themes.Soft(
primary_hue="indigo",
secondary_hue="purple",
# ... more options
)
π Performance
- Token streaming for responsive feel
- Background search doesn't block UI
- Efficient memory management with cache clearing
- ZeroGPU acceleration for fast inference
- Optimized loading with dtype fallbacks
π€ Contributing
Contributions welcome! Areas for improvement:
- Additional model integrations
- UI/UX enhancements
- Performance optimizations
- Bug fixes and testing
- Documentation improvements
π License
Apache 2.0 - See LICENSE file for details
π Acknowledgments
- Built with Gradio
- Powered by Hugging Face Transformers
- Uses ZeroGPU for acceleration
- Search via DuckDuckGo
Made with β€οΈ for the open source community