Orbi-1B GGUF
Quantized GGUF version of Orbi-1B, a fine-tuned TinyLlama-1.1B-Chat specialized for function calling and robotic assistant interactions. This model generates structured tool calls in response to natural language commands and is optimized for CPU inference with llama.cpp.
Model Description
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Model Size: 1.1B parameters
- Format: GGUF (llama.cpp compatible)
- Quantization: Q4_K_M (4-bit quantization)
- Optimized for: CPU inference, low memory usage
- License: Apache 2.0
Why GGUF?
GGUF (GPT-Generated Unified Format) offers several advantages:
- Faster CPU Inference: Optimized for running on CPU without GPU
- Lower Memory Usage: 4-bit quantization reduces model size by ~75%
- Cross-Platform: Works on Windows, Linux, macOS (including Apple Silicon)
- No GPU Required: Perfect for edge devices and embedded systems
- Efficient: Powered by llama.cpp's optimized C++ inference engine
File Information
| File | Quant | Size | Use Case |
|---|---|---|---|
| orbi-1b-q4.gguf | Q4_K_M | ~650MB | Recommended - Best balance of speed and quality |
Installation
Requirements
pip install llama-cpp-python
For GPU acceleration (optional):
# CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# Metal (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Usage
Basic Inference
import json
import re
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="orbi-1b-q4.gguf",
n_ctx=4096,
n_threads=8,
use_mlock=True,
verbose=False
)
# System prompt
system_prompt = """You are Orbi's brain.
Respond with one or more <tool_call> JSON blocks, in the exact order the user requests actions.
that calls the best tool for the user's request. Do not write stories yourself.
Do not summarize news yourself. Map synonyms to the tool argument enums.
If parameters are missing, pick sensible defaults. Keep outputs terse.
Available tools and enums:
- smile() -> {}
- cry() -> {}
- move_hands(direction โ {left,right,up,down,wave}, speed โ {slow,normal,fast})
- dance(style โ {hiphop,ballet,robot,random}, duration_sec โ [10..120])
- tell_news(topic: string)
- tell_story(topic: string, tone โ {wholesome,funny,dramatic,spooky,random}, length โ {short,medium,long})
"""
# Build prompt
user_input = "Wave your hands quickly and smile"
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
# Generate
output = llm(
prompt,
max_tokens=256,
temperature=0.0,
)
response = output["choices"][0]["text"]
print(response)
Interactive Controller
Save this as orbi_controller.py:
import json
import re
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="orbi-1b-q4.gguf",
n_ctx=4096,
n_threads=8,
)
def parse_tool_calls(text):
pattern = r"<tool_call>\s*(\{.*?\})\s*</tool_call>"
matches = re.findall(pattern, text, re.DOTALL)
tools = []
for m in matches:
try:
tools.append(json.loads(m))
except:
continue
return tools
# Interactive loop
print("๐ค Orbi is ready! Type 'exit' to quit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in {"exit", "quit"}:
break
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
output = llm(prompt, max_tokens=256, temperature=0.0)
response = output["choices"][0]["text"]
tools = parse_tool_calls(response)
print(f"Orbi: {json.dumps(tools, indent=2)}\n")
Expected Output Format
<tool_call>
{"name": "move_hands", "arguments": {"direction": "wave", "speed": "fast"}}
</tool_call>
<tool_call>
{"name": "smile", "arguments": {}}
</tool_call>
Performance Benchmarks
Approximate inference speeds on different hardware:
| Hardware | Tokens/sec | Memory Usage |
|---|---|---|
| M1 MacBook Pro | ~45 t/s | 800MB |
| Intel i7-12700K | ~35 t/s | 750MB |
| Raspberry Pi 5 | ~8 t/s | 700MB |
| AMD Ryzen 7 5800X | ~40 t/s | 750MB |
Note: Actual performance may vary based on context length and system configuration.
Supported Tools
- Physical Actions:
smile(),cry(),move_hands(),dance() - Content Generation:
tell_news(),tell_story() - Information:
whats_your_name(),who_am_i() - Utilities:
answer_arithmetic(),english_learning()
Configuration Options
llama-cpp-python Parameters
llm = Llama(
model_path="orbi-1b-q4.gguf",
n_ctx=4096, # Context window size
n_threads=8, # CPU threads (adjust based on your CPU)
n_gpu_layers=0, # Set > 0 for GPU offloading
use_mlock=True, # Lock model in RAM (prevents swapping)
verbose=False, # Disable verbose logging
seed=42, # Set seed for reproducibility
)
Generation Parameters
output = llm(
prompt,
max_tokens=256, # Maximum tokens to generate
temperature=0.0, # 0.0 = greedy (recommended for tool calling)
top_p=0.95, # Nucleus sampling
repeat_penalty=1.1, # Penalize repetition
stop=["</tool_call>"], # Stop sequences
)
Use Cases
- Robotics: Control physical robots with natural language
- IoT Devices: Run on Raspberry Pi or similar edge devices
- Embedded Systems: Low-memory environments
- Offline Applications: No internet connection required
- Desktop Assistants: CPU-only machines without GPU
Limitations
- Quantization may result in slight quality degradation compared to full precision
- Best performance with greedy decoding (temperature=0.0)
- Limited to the predefined set of tools
- Context window is 4096 tokens (inherited from base model)
Model Details
Training
- Method: LoRA fine-tuning on TinyLlama-1.1B-Chat
- Dataset: Custom conversational dataset with tool calling examples
- Framework: Transformers + PEFT + TRL
Quantization
- Method: Q4_K_M quantization via llama.cpp
- Benefits: ~75% size reduction with minimal quality loss
- Original Size: ~2.2GB โ GGUF Size: ~650MB
Troubleshooting
Model loads slowly
- Enable
use_mlock=Trueto keep model in RAM - Increase
n_threadsbased on your CPU cores
Out of memory
- Reduce
n_ctx(context window size) - Close other applications
- Use a lower quantization (Q2_K or Q3_K_M)
Slow inference
- Increase
n_threadsto match your CPU cores - Enable GPU offloading with
n_gpu_layers - Reduce
max_tokensif generating long responses
License
Apache 2.0 (inherited from TinyLlama base model)
Citation
@misc{orbi-1b-gguf,
title={Orbi-1B GGUF: Quantized Function Calling Model},
author={Arojit Ghosh},
year={2025},
howpublished={\url{https://huggingface.co/Arojit/orbi-1b-gguf}}
}
Related Models
- Full Precision: Arojit/orbi-1b
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Acknowledgments
- Built on TinyLlama
- Quantized using llama.cpp
- Powered by llama-cpp-python
- Downloads last month
- 3
Hardware compatibility
Log In
to view the estimation
We're not able to determine the quantization variants.
Model tree for Arojit/orbi-1b-gguf
Base model
TinyLlama/TinyLlama-1.1B-Chat-v1.0