Orbi-1B GGUF

Quantized GGUF version of Orbi-1B, a fine-tuned TinyLlama-1.1B-Chat specialized for function calling and robotic assistant interactions. This model generates structured tool calls in response to natural language commands and is optimized for CPU inference with llama.cpp.

Model Description

  • Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Model Size: 1.1B parameters
  • Format: GGUF (llama.cpp compatible)
  • Quantization: Q4_K_M (4-bit quantization)
  • Optimized for: CPU inference, low memory usage
  • License: Apache 2.0

Why GGUF?

GGUF (GPT-Generated Unified Format) offers several advantages:

  • Faster CPU Inference: Optimized for running on CPU without GPU
  • Lower Memory Usage: 4-bit quantization reduces model size by ~75%
  • Cross-Platform: Works on Windows, Linux, macOS (including Apple Silicon)
  • No GPU Required: Perfect for edge devices and embedded systems
  • Efficient: Powered by llama.cpp's optimized C++ inference engine

File Information

File Quant Size Use Case
orbi-1b-q4.gguf Q4_K_M ~650MB Recommended - Best balance of speed and quality

Installation

Requirements

pip install llama-cpp-python

For GPU acceleration (optional):

# CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# Metal (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Usage

Basic Inference

import json
import re
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,
    n_threads=8,
    use_mlock=True,
    verbose=False
)

# System prompt
system_prompt = """You are Orbi's brain.
Respond with one or more <tool_call> JSON blocks, in the exact order the user requests actions.
that calls the best tool for the user's request. Do not write stories yourself.
Do not summarize news yourself. Map synonyms to the tool argument enums.
If parameters are missing, pick sensible defaults. Keep outputs terse.

Available tools and enums:
- smile() -> {}
- cry() -> {}
- move_hands(direction โˆˆ {left,right,up,down,wave}, speed โˆˆ {slow,normal,fast})
- dance(style โˆˆ {hiphop,ballet,robot,random}, duration_sec โˆˆ [10..120])
- tell_news(topic: string)
- tell_story(topic: string, tone โˆˆ {wholesome,funny,dramatic,spooky,random}, length โˆˆ {short,medium,long})
"""

# Build prompt
user_input = "Wave your hands quickly and smile"
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"

# Generate
output = llm(
    prompt,
    max_tokens=256,
    temperature=0.0,
)

response = output["choices"][0]["text"]
print(response)

Interactive Controller

Save this as orbi_controller.py:

import json
import re
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,
    n_threads=8,
)

def parse_tool_calls(text):
    pattern = r"<tool_call>\s*(\{.*?\})\s*</tool_call>"
    matches = re.findall(pattern, text, re.DOTALL)
    tools = []
    for m in matches:
        try:
            tools.append(json.loads(m))
        except:
            continue
    return tools

# Interactive loop
print("๐Ÿค– Orbi is ready! Type 'exit' to quit.\n")
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in {"exit", "quit"}:
        break
    
    prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
    output = llm(prompt, max_tokens=256, temperature=0.0)
    response = output["choices"][0]["text"]
    
    tools = parse_tool_calls(response)
    print(f"Orbi: {json.dumps(tools, indent=2)}\n")

Expected Output Format

<tool_call>
{"name": "move_hands", "arguments": {"direction": "wave", "speed": "fast"}}
</tool_call>
<tool_call>
{"name": "smile", "arguments": {}}
</tool_call>

Performance Benchmarks

Approximate inference speeds on different hardware:

Hardware Tokens/sec Memory Usage
M1 MacBook Pro ~45 t/s 800MB
Intel i7-12700K ~35 t/s 750MB
Raspberry Pi 5 ~8 t/s 700MB
AMD Ryzen 7 5800X ~40 t/s 750MB

Note: Actual performance may vary based on context length and system configuration.

Supported Tools

  • Physical Actions: smile(), cry(), move_hands(), dance()
  • Content Generation: tell_news(), tell_story()
  • Information: whats_your_name(), who_am_i()
  • Utilities: answer_arithmetic(), english_learning()

Configuration Options

llama-cpp-python Parameters

llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,              # Context window size
    n_threads=8,             # CPU threads (adjust based on your CPU)
    n_gpu_layers=0,          # Set > 0 for GPU offloading
    use_mlock=True,          # Lock model in RAM (prevents swapping)
    verbose=False,           # Disable verbose logging
    seed=42,                 # Set seed for reproducibility
)

Generation Parameters

output = llm(
    prompt,
    max_tokens=256,          # Maximum tokens to generate
    temperature=0.0,         # 0.0 = greedy (recommended for tool calling)
    top_p=0.95,             # Nucleus sampling
    repeat_penalty=1.1,      # Penalize repetition
    stop=["</tool_call>"],   # Stop sequences
)

Use Cases

  • Robotics: Control physical robots with natural language
  • IoT Devices: Run on Raspberry Pi or similar edge devices
  • Embedded Systems: Low-memory environments
  • Offline Applications: No internet connection required
  • Desktop Assistants: CPU-only machines without GPU

Limitations

  • Quantization may result in slight quality degradation compared to full precision
  • Best performance with greedy decoding (temperature=0.0)
  • Limited to the predefined set of tools
  • Context window is 4096 tokens (inherited from base model)

Model Details

Training

  • Method: LoRA fine-tuning on TinyLlama-1.1B-Chat
  • Dataset: Custom conversational dataset with tool calling examples
  • Framework: Transformers + PEFT + TRL

Quantization

  • Method: Q4_K_M quantization via llama.cpp
  • Benefits: ~75% size reduction with minimal quality loss
  • Original Size: ~2.2GB โ†’ GGUF Size: ~650MB

Troubleshooting

Model loads slowly

  • Enable use_mlock=True to keep model in RAM
  • Increase n_threads based on your CPU cores

Out of memory

  • Reduce n_ctx (context window size)
  • Close other applications
  • Use a lower quantization (Q2_K or Q3_K_M)

Slow inference

  • Increase n_threads to match your CPU cores
  • Enable GPU offloading with n_gpu_layers
  • Reduce max_tokens if generating long responses

License

Apache 2.0 (inherited from TinyLlama base model)

Citation

@misc{orbi-1b-gguf,
  title={Orbi-1B GGUF: Quantized Function Calling Model},
  author={Arojit Ghosh},
  year={2025},
  howpublished={\url{https://huggingface.co/Arojit/orbi-1b-gguf}}
}

Related Models

Acknowledgments

Downloads last month
3
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Arojit/orbi-1b-gguf

Quantized
(120)
this model

Evaluation results