apollo-astralis-8b / USAGE_GUIDE.md

Tyler Williams

chore: rename GGUF to apollo_astralis_8b.gguf and update docs

3c4ec38 about 2 months ago

preview code

raw

history blame

25 kB

Apollo Astralis 8B Usage Guide

Installation & Setup
Deployment Methods
Usage Patterns
Advanced Usage
Integration Examples
Performance Optimization
Troubleshooting
Best Practices

Installation & Setup

Option 1: Ollama (Recommended)

The simplest way to use Apollo Astralis:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh

# Download the GGUF model file
wget https://huggingface.co/vanta-research/apollo-astralis-8b/resolve/main/apollo_astralis_8b.gguf

# Create Modelfile
cat > Modelfile-apollo-astralis <<EOF
from ./apollo_astralis_8b.gguf

template """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

parameter num_predict 256
parameter temperature 0.7
parameter top_p 0.9
parameter top_k 40
parameter repeat_penalty 1.15
parameter stop <|im_start|>
parameter stop <|im_end|>

system """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps. When you're uncertain, you think through possibilities openly and invite collaboration. Your goal is to help users understand not just the answer, but the reasoning process itself."""
EOF

# Create the model
ollama create apollo-astralis -f Modelfile-apollo-astralis

# Start chatting!
ollama run apollo-astralis

Option 2: Python with HuggingFace

For programmatic access via Python:

# Install dependencies
pip install torch transformers peft accelerate

# Or with GPU support
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install transformers peft accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load and apply LoRA adapter
model = PeftModel.from_pretrained(model, "vanta-research/apollo-astralis-8b")
model.eval()

print("Apollo Astralis 8B loaded successfully!")

Option 3: GGUF with llama.cpp

For C++ based deployment:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download model
wget https://huggingface.co/vanta-research/apollo-astralis-8b/resolve/main/apollo_astralis_8b.gguf

# Run inference
./main -m apollo_astralis_8b.gguf \
  --prompt "Solve this problem: If x + 7 = 15, what is x?" \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.15 \
  -n 256

Deployment Methods

Conservative Mode (Default - 256 tokens)

Best for most tasks with balanced response length:

# Modelfile
from ./apollo_astralis_8b.gguf

template """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

parameter num_predict 256
parameter temperature 0.7
parameter top_p 0.9
parameter top_k 40
parameter repeat_penalty 1.15
parameter stop <|im_start|>
parameter stop <|im_end|>

system """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps. When you're uncertain, you think through possibilities openly and invite collaboration. Your goal is to help users understand not just the answer, but the reasoning process itself."""

Unlimited Mode (For Complex Reasoning)

For multi-step reasoning requiring extended chain-of-thought:

# Modelfile-unlimited
from ./apollo_astralis_8b.gguf

template """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

parameter num_predict -1
parameter temperature 0.7
parameter top_p 0.9
parameter top_k 40
parameter repeat_penalty 1.15
parameter stop <|im_start|>
parameter stop <|im_end|>

system """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps. When you're uncertain, you think through possibilities openly and invite collaboration. Your goal is to help users understand not just the answer, but the reasoning process itself."""

Create with: ollama create apollo-astralis-unlimited -f Modelfile-unlimited

Usage Patterns

1. Mathematical Problem Solving

Apollo excels at step-by-step mathematical reasoning:

def solve_math_problem(problem, max_tokens=512):
    """Solve mathematical problems with detailed explanations."""
    prompt = f"Solve this problem step by step: {problem}"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
problems = [
    "If a train travels 120 miles in 2 hours, what is its average speed?",
    "Calculate 15% of 240",
    "Solve for x: 3x + 7 = 22",
    "A rectangle has length 8 and width 5. Find its area and perimeter."
]

for problem in problems:
    print(f"\n{'='*60}")
    print(f"Problem: {problem}")
    print(f"{'='*60}")
    solution = solve_math_problem(problem)
    print(solution)

Example Output:

Problem: Solve for x: 3x + 7 = 22

<think>
I need to solve this linear equation step by step:
1. Isolate the term with x
2. Divide to find x
3. Verify the answer
</think>

Let's solve this together!

Step 1: Subtract 7 from both sides
3x + 7 - 7 = 22 - 7
3x = 15

Step 2: Divide both sides by 3
3x ÷ 3 = 15 ÷ 3
x = 5

Step 3: Verify
3(5) + 7 = 15 + 7 = 22 ✓

Therefore, x = 5!

2. Logical Reasoning

Apollo handles complex logical structures:

def analyze_logic_problem(problem):
    """Analyze logical reasoning problems with clear structure."""
    prompt = f"Analyze this logical reasoning problem: {problem}"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
logic_problems = [
    "If all cats are mammals, and Fluffy is a cat, what can we conclude about Fluffy?",
    "All roses are flowers. Some flowers fade quickly. Can we conclude that some roses fade quickly?",
    "If it rains, the ground gets wet. The ground is wet. Did it necessarily rain?",
]

for problem in logic_problems:
    print(f"\n{'='*60}")
    print(f"Problem: {problem}")
    print(f"{'='*60}")
    analysis = analyze_logic_problem(problem)
    print(analysis)

3. Creative Problem Solving

Apollo approaches puzzles with enthusiasm:

def solve_puzzle(puzzle):
    """Solve creative puzzles with step-by-step reasoning."""
    prompt = f"Solve this puzzle: {puzzle}"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Classic puzzles
puzzles = [
    "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters?",
    "A farmer needs to cross a river with a wolf, a goat, and a cabbage. The boat can only hold the farmer and one item. How does he get everything across safely?",
    "Three light switches control three light bulbs in another room. You can flip switches but only visit the room once. How do you determine which switch controls which bulb?",
]

for puzzle in puzzles:
    print(f"\n{'='*60}")
    print(f"Puzzle: {puzzle}")
    print(f"{'='*60}")
    solution = solve_puzzle(puzzle)
    print(solution)

4. Collaborative Brainstorming

Apollo's warm personality shines in collaborative tasks:

def brainstorm_with_apollo(topic, context=""):
    """Brainstorm ideas with Apollo's collaborative approach."""
    prompt = f"Let's brainstorm together about: {topic}"
    if context:
        prompt += f"\n\nContext: {context}"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.8,  # Slightly higher for creativity
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Examples
topics = [
    "How can we make online learning more engaging for students?",
    "What are some creative ways to reduce food waste at home?",
    "How might AI assistants help with mental wellness?",
]

for topic in topics:
    print(f"\n{'='*60}")
    print(f"Topic: {topic}")
    print(f"{'='*60}")
    ideas = brainstorm_with_apollo(topic)
    print(ideas)

5. Code Reasoning & Debugging

Apollo helps understand and fix code:

def analyze_code(code, question=""):
    """Analyze code with reasoning about logic and improvements."""
    prompt = f"""Analyze this code:

```python
{code}

{question if question else "Explain what it does and suggest improvements."} """

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

return tokenizer.decode(outputs[0], skip_special_tokens=True)

Example

buggy_code = """ def calculate_average(numbers): total = 0 for num in numbers: total += num return total / len(numbers)

result = calculate_average([]) print(result) """

analysis = analyze_code(buggy_code, "What's wrong with this code and how can we fix it?") print(analysis)


## Advanced Usage

### Batch Processing

Process multiple questions efficiently:

```python
def batch_process(questions, batch_size=4):
    """Process multiple questions in batches."""
    results = []
    
    for i in range(0, len(questions), batch_size):
        batch = questions[i:i+batch_size]
        
        # Tokenize batch
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id
            )
        
        # Decode batch
        batch_results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        results.extend(batch_results)
    
    return results

# Example
questions = [
    "What is 25 + 37?",
    "Explain the concept of recursion",
    "How do I sort a list in Python?",
    "What's the difference between a list and a tuple?",
]

answers = batch_process(questions)
for q, a in zip(questions, answers):
    print(f"\nQ: {q}\nA: {a}\n{'-'*60}")

Memory-Efficient Generation

For limited GPU memory:

def memory_efficient_generate(prompt, max_tokens=400):
    """Generate responses with minimal memory usage."""
    # Clear cache
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Use no_grad and enable KV caching
    with torch.no_grad():
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            use_cache=True,  # Enable KV caching
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
        result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Clear cache again
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    return result

Streaming Responses

For real-time generation:

from transformers import TextIteratorStreamer
from threading import Thread

def stream_response(prompt):
    """Stream responses token by token."""
    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generation in a separate thread
    generation_kwargs = {
        **inputs,
        "streamer": streamer,
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": True,
    }
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    # Stream output
    print("Apollo: ", end="", flush=True)
    for text in streamer:
        print(text, end="", flush=True)
    print()  # New line at end
    
    thread.join()

# Example
stream_response("Explain quantum entanglement in simple terms")

Custom System Prompts

Adapt Apollo's personality for specific contexts:

def create_custom_prompt(user_message, system_prompt=None):
    """Create a chat prompt with custom system prompt."""
    default_system = """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps."""
    
    system = system_prompt or default_system
    
    chat = [
        {"role": "system", "content": system},
        {"role": "user", "content": user_message}
    ]
    
    return tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Example: Focus on brevity
brief_system = """You are Apollo, an AI assistant focused on clear, concise explanations. Provide direct answers with minimal extra commentary, but maintain a friendly tone."""

prompt = create_custom_prompt("What is photosynthesis?", system_prompt=brief_system)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Integration Examples

FastAPI Server

Deploy Apollo as a REST API:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
import uvicorn

app = FastAPI(title="Apollo Astralis API", version="1.0.0")

# Load model on startup
@app.on_event("startup")
async def load_model():
    global model, tokenizer
    # Model loading code here...
    print("Apollo Astralis loaded and ready!")

class Question(BaseModel):
    text: str
    max_tokens: Optional[int] = 256
    temperature: Optional[float] = 0.7
    system_prompt: Optional[str] = None

@app.post("/ask")
async def ask_apollo(question: Question):
    """Ask Apollo a question."""
    try:
        prompt = create_custom_prompt(question.text, question.system_prompt)
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=question.max_tokens,
                temperature=question.temperature,
                pad_token_id=tokenizer.eos_token_id
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        return {
            "question": question.text,
            "response": response,
            "model": "apollo-astralis-8b-v5-conservative"
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy", "model": "apollo-astralis-8b"}

# Run with: uvicorn app:app --host 0.0.0.0 --port 8000
if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Gradio Interface

Create an interactive web UI:

import gradio as gr

def apollo_chat(message, history, temperature=0.7, max_tokens=512):
    """Chat with Apollo using Gradio."""
    # Format conversation history
    chat = []
    for h in history:
        chat.append({"role": "user", "content": h[0]})
        chat.append({"role": "assistant", "content": h[1]})
    chat.append({"role": "user", "content": message})
    
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

# Create interface
interface = gr.ChatInterface(
    fn=apollo_chat,
    title="Apollo Astralis 8B",
    description="Chat with Apollo - a reasoning-focused AI with warm personality",
    theme=gr.themes.Soft(),
    examples=[
        "Solve for x: 2x + 5 = 17",
        "Explain recursion with a simple example",
        "Help me brainstorm ideas for a science fair project",
        "What's the difference between correlation and causation?",
    ],
    additional_inputs=[
        gr.Slider(0.1, 1.0, value=0.7, label="Temperature"),
        gr.Slider(128, 1024, value=512, step=128, label="Max Tokens"),
    ]
)

interface.launch(share=True)

Command Line Interface

Simple CLI tool:

#!/usr/bin/env python3
import argparse
import sys

def main():
    parser = argparse.ArgumentParser(description="Apollo Astralis CLI")
    parser.add_argument("prompt", help="Question or prompt for Apollo")
    parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature")
    parser.add_argument("--max-tokens", type=int, default=512, help="Maximum tokens to generate")
    parser.add_argument("--stream", action="store_true", help="Stream output token by token")
    
    args = parser.parse_args()
    
    # Load model if not already loaded
    global model, tokenizer
    if 'model' not in globals():
        print("Loading Apollo Astralis...", file=sys.stderr)
        # Load model...
        print("Ready!", file=sys.stderr)
    
    # Generate response
    if args.stream:
        stream_response(args.prompt)
    else:
        inputs = tokenizer(args.prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=args.max_tokens,
                temperature=args.temperature,
                pad_token_id=tokenizer.eos_token_id
            )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(response)

if __name__ == "__main__":
    main()

Usage: ./apollo_cli.py "What is the Pythagorean theorem?" --stream

Performance Optimization

GPU Optimization

# Use Flash Attention 2 (if available)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Requires flash-attn package
)

# Use torch.compile for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

CPU Optimization

# Optimize for CPU inference
import torch
torch.set_num_threads(8)  # Adjust based on your CPU

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float32,  # Use float32 on CPU
    device_map="cpu"
)

# Use optimized GGUF with llama.cpp instead

Memory Optimization

# 8-bit quantization (requires bitsandbytes)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quantization_config,
    device_map="auto"
)

Troubleshooting

Common Issues

Issue: Out of memory error

Solutions:

Use quantized GGUF model (Q4_K_M recommended)
Reduce max_new_tokens
Use gradient checkpointing
Enable 8-bit quantization
Use CPU inference

Issue: Slow generation speed

Solutions:

Use GPU instead of CPU
Enable Flash Attention 2
Use torch.compile()
Reduce max_new_tokens
Use GGUF with llama.cpp

Issue: Responses cut off mid-sentence

Solutions:

Increase num_predict parameter (Ollama)
Increase max_new_tokens (Python)
Use unlimited variant for complex reasoning
Check for EOS token issues

Issue: Extracted answers don't match response content

Solutions:

Parse final answer after <think> blocks
Look for "Therefore," or "Answer:" markers
Use regex to extract final conclusions
Manually verify automated scores

Best Practices

Prompt Engineering

Good Prompts:

Clear and specific questions
Provide context when needed
Request step-by-step explanations
Ask for verification of results

Examples:

# Good
"Solve for x step by step: 3x + 7 = 22"

# Better
"Solve for x: 3x + 7 = 22. Show your work and verify the answer."

# Best
"I'm learning algebra. Can you solve for x in this equation: 3x + 7 = 22? Please show each step and explain what you're doing."

Temperature Settings

0.1-0.3: Factual questions, mathematics, logical reasoning
0.5-0.7: General conversation, explanations, problem-solving (default)
0.8-1.0: Creative brainstorming, multiple perspectives

Token Limits

128-256: Quick answers, simple questions
256-512: Standard explanations, moderate reasoning (default)
512-1024: Complex problems, multi-step reasoning
Unlimited (-1): Extended chain-of-thought, very complex problems

Answer Extraction

When parsing Apollo's responses programmatically:

import re

def extract_final_answer(response):
    """Extract final answer from Apollo's response."""
    # Look for explicit answer markers
    patterns = [
        r"Therefore,?\s*(.+?)(?:\n|$)",
        r"Answer:\s*(.+?)(?:\n|$)",
        r"(?:The )?final answer is\s*(.+?)(?:\n|$)",
        r"x\s*=\s*([^,\n]+)",  # For algebra
    ]
    
    for pattern in patterns:
        match = re.search(pattern, response, re.IGNORECASE)
        if match:
            return match.group(1).strip()
    
    # Fallback: last non-empty line after <think> block
    if "</think>" in response:
        post_think = response.split("</think>")[-1]
        lines = [l.strip() for l in post_think.split("\n") if l.strip()]
        if lines:
            return lines[-1]
    
    # Ultimate fallback: last non-empty line
    lines = [l.strip() for l in response.split("\n") if l.strip()]
    return lines[-1] if lines else response

Personality Awareness

Apollo's warm personality is intentional, but may not suit all contexts:

Appropriate:

Educational environments
Collaborative work
Learning and exploration
Friendly assistance

Less appropriate:

Formal academic papers
Clinical documentation
Legal or medical contexts requiring neutrality
High-stakes professional advice

Verification & Validation

Always verify critical information:

def verify_with_apollo(claim, reasoning):
    """Ask Apollo to verify its own reasoning."""
    prompt = f"""Please verify this reasoning:

Claim: {claim}
Reasoning: {reasoning}

Is this correct? If not, what's wrong?"""
    
    # Generate verification response...
    return verification

Additional Resources

Model Card: See MODEL_CARD.md for technical details
GitHub: https://github.com/vanta-research/apollo-astralis-8b
HuggingFace: https://huggingface.co/vanta-research/apollo-astralis-8b
Documentation: https://vanta.ai/models/apollo-astralis-8b
Issues: Report bugs and request features on GitHub

Community & Support

Discord: Join the VANTA Research community
Discussions: HuggingFace model discussions
Email: [email protected]

Happy reasoning with Apollo Astralis! 🚀

vanta-research
/

apollo-astralis-8b

Apollo Astralis 8B Usage Guide

Table of Contents

Installation & Setup

Option 1: Ollama (Recommended)

Option 2: Python with HuggingFace

Option 3: GGUF with llama.cpp

Deployment Methods

Conservative Mode (Default - 256 tokens)

Unlimited Mode (For Complex Reasoning)

Usage Patterns

1. Mathematical Problem Solving

2. Logical Reasoning

3. Creative Problem Solving

4. Collaborative Brainstorming

5. Code Reasoning & Debugging

Example

Memory-Efficient Generation

Streaming Responses

Custom System Prompts

Integration Examples

FastAPI Server

Gradio Interface

Command Line Interface

Performance Optimization

GPU Optimization

CPU Optimization

Memory Optimization

Troubleshooting

Common Issues

Best Practices

Prompt Engineering

Temperature Settings

Token Limits

Answer Extraction

Personality Awareness

Verification & Validation

Additional Resources

Community & Support