Orbi-1B GGUF

Quantized GGUF version of Orbi-1B, a fine-tuned TinyLlama-1.1B-Chat specialized for function calling and robotic assistant interactions. This model generates structured tool calls in response to natural language commands and is optimized for CPU inference with llama.cpp.

Model Description

Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Model Size: 1.1B parameters
Format: GGUF (llama.cpp compatible)
Quantization: Q4_K_M (4-bit quantization)
Optimized for: CPU inference, low memory usage
License: Apache 2.0

Why GGUF?

GGUF (GPT-Generated Unified Format) offers several advantages:

Faster CPU Inference: Optimized for running on CPU without GPU
Lower Memory Usage: 4-bit quantization reduces model size by ~75%
Cross-Platform: Works on Windows, Linux, macOS (including Apple Silicon)
No GPU Required: Perfect for edge devices and embedded systems
Efficient: Powered by llama.cpp's optimized C++ inference engine

File Information

File	Quant	Size	Use Case
orbi-1b-q4.gguf	Q4_K_M	~650MB	Recommended - Best balance of speed and quality

Installation

Requirements

pip install llama-cpp-python

For GPU acceleration (optional):

# CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# Metal (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Usage

Basic Inference

import json
import re
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,
    n_threads=8,
    use_mlock=True,
    verbose=False
)

# System prompt
system_prompt = """You are Orbi's brain.
Respond with one or more <tool_call> JSON blocks, in the exact order the user requests actions.
that calls the best tool for the user's request. Do not write stories yourself.
Do not summarize news yourself. Map synonyms to the tool argument enums.
If parameters are missing, pick sensible defaults. Keep outputs terse.

Available tools and enums:
- smile() -> {}
- cry() -> {}
- move_hands(direction ∈ {left,right,up,down,wave}, speed ∈ {slow,normal,fast})
- dance(style ∈ {hiphop,ballet,robot,random}, duration_sec ∈ [10..120])
- tell_news(topic: string)
- tell_story(topic: string, tone ∈ {wholesome,funny,dramatic,spooky,random}, length ∈ {short,medium,long})
"""

# Build prompt
user_input = "Wave your hands quickly and smile"
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"

# Generate
output = llm(
    prompt,
    max_tokens=256,
    temperature=0.0,
)

response = output["choices"][0]["text"]
print(response)

Interactive Controller

Save this as orbi_controller.py:

import json
import re
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,
    n_threads=8,
)

def parse_tool_calls(text):
    pattern = r"<tool_call>\s*(\{.*?\})\s*</tool_call>"
    matches = re.findall(pattern, text, re.DOTALL)
    tools = []
    for m in matches:
        try:
            tools.append(json.loads(m))
        except:
            continue
    return tools

# Interactive loop
print("🤖 Orbi is ready! Type 'exit' to quit.\n")
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in {"exit", "quit"}:
        break
    
    prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
    output = llm(prompt, max_tokens=256, temperature=0.0)
    response = output["choices"][0]["text"]
    
    tools = parse_tool_calls(response)
    print(f"Orbi: {json.dumps(tools, indent=2)}\n")

Expected Output Format

<tool_call>
{"name": "move_hands", "arguments": {"direction": "wave", "speed": "fast"}}
</tool_call>
<tool_call>
{"name": "smile", "arguments": {}}
</tool_call>

Performance Benchmarks

Approximate inference speeds on different hardware:

Hardware	Tokens/sec	Memory Usage
M1 MacBook Pro	~45 t/s	800MB
Intel i7-12700K	~35 t/s	750MB
Raspberry Pi 5	~8 t/s	700MB
AMD Ryzen 7 5800X	~40 t/s	750MB

Note: Actual performance may vary based on context length and system configuration.

Supported Tools

Physical Actions: smile(), cry(), move_hands(), dance()
Content Generation: tell_news(), tell_story()
Information: whats_your_name(), who_am_i()
Utilities: answer_arithmetic(), english_learning()

Configuration Options

llama-cpp-python Parameters

llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,              # Context window size
    n_threads=8,             # CPU threads (adjust based on your CPU)
    n_gpu_layers=0,          # Set > 0 for GPU offloading
    use_mlock=True,          # Lock model in RAM (prevents swapping)
    verbose=False,           # Disable verbose logging
    seed=42,                 # Set seed for reproducibility
)

Generation Parameters

output = llm(
    prompt,
    max_tokens=256,          # Maximum tokens to generate
    temperature=0.0,         # 0.0 = greedy (recommended for tool calling)
    top_p=0.95,             # Nucleus sampling
    repeat_penalty=1.1,      # Penalize repetition
    stop=["</tool_call>"],   # Stop sequences
)

Use Cases

Robotics: Control physical robots with natural language
IoT Devices: Run on Raspberry Pi or similar edge devices
Embedded Systems: Low-memory environments
Offline Applications: No internet connection required
Desktop Assistants: CPU-only machines without GPU

Limitations

Quantization may result in slight quality degradation compared to full precision
Best performance with greedy decoding (temperature=0.0)
Limited to the predefined set of tools
Context window is 4096 tokens (inherited from base model)

Model Details

Training

Method: LoRA fine-tuning on TinyLlama-1.1B-Chat
Dataset: Custom conversational dataset with tool calling examples
Framework: Transformers + PEFT + TRL

Quantization

Method: Q4_K_M quantization via llama.cpp
Benefits: ~75% size reduction with minimal quality loss
Original Size: ~2.2GB → GGUF Size: ~650MB

Troubleshooting

Model loads slowly

Enable use_mlock=True to keep model in RAM
Increase n_threads based on your CPU cores

Out of memory

Reduce n_ctx (context window size)
Close other applications
Use a lower quantization (Q2_K or Q3_K_M)

Slow inference

Increase n_threads to match your CPU cores
Enable GPU offloading with n_gpu_layers
Reduce max_tokens if generating long responses

License

Apache 2.0 (inherited from TinyLlama base model)

Citation

@misc{orbi-1b-gguf,
  title={Orbi-1B GGUF: Quantized Function Calling Model},
  author={Arojit Ghosh},
  year={2025},
  howpublished={\url{https://huggingface.co/Arojit/orbi-1b-gguf}}
}

Related Models

Full Precision: Arojit/orbi-1b
Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Acknowledgments

Built on TinyLlama
Quantized using llama.cpp
Powered by llama-cpp-python

Downloads last month: 3

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Arojit/orbi-1b-gguf

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Quantized

(120)

this model

Arojit
/

orbi-1b-gguf