Apollo Astralis 8B - Adapter Merge Guide
Overview
This guide explains how to use the Apollo Astralis 8B LoRA adapters with the base Qwen3-8B model. You can either:
- Use adapters directly with PEFT (recommended for development)
- Merge adapters into the base model (recommended for production)
Option 1: Use Adapters with PEFT (Recommended)
The simplest approach - no merging required:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model
base_model = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load and apply LoRA adapters
model = PeftModel.from_pretrained(model, "vanta-research/apollo-astralis-8b")
model.eval()
print("Apollo Astralis 8B ready!")
Advantages
- β Simple and straightforward
- β No extra disk space required
- β Can easily swap between base and fine-tuned models
- β Faster initial loading
Disadvantages
- β Slightly slower inference (adapter application overhead)
- β Requires PEFT library
Option 2: Merge Adapters into Base Model
For production deployments requiring maximum inference speed:
Step 1: Install Dependencies
pip install torch transformers peft accelerate
Step 2: Merge Script
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
def merge_adapters(
base_model_name="Qwen/Qwen3-8B",
adapter_model_name="vanta-research/apollo-astralis-8b",
output_path="./apollo-astralis-8b-merged"
):
"""Merge LoRA adapters into base model."""
print(f"Loading base model: {base_model_name}")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
print(f"Loading adapters: {adapter_model_name}")
model = PeftModel.from_pretrained(base_model, adapter_model_name)
print("Merging adapters into base model...")
model = model.merge_and_unload()
print(f"Saving merged model to: {output_path}")
model.save_pretrained(output_path, safe_serialization=True)
print("Saving tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.save_pretrained(output_path)
print("β
Merge complete!")
return model
# Run merge
merged_model = merge_adapters()
Step 3: Use Merged Model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load merged model
model = AutoModelForCausalLM.from_pretrained(
"./apollo-astralis-8b-merged",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./apollo-astralis-8b-merged")
# Use normally - no PEFT required!
Advantages
- β Faster inference (no adapter overhead)
- β No PEFT dependency required
- β Easier to quantize and convert to other formats
- β Better for production deployment
Disadvantages
- β Requires ~16GB disk space for merged model
- β One-time merge process required
- β Cannot easily swap back to base model
Option 3: Convert to GGUF for Ollama
For efficient local deployment with Ollama:
Step 1: Merge Adapters (see Option 2)
Step 2: Convert to GGUF
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Install Python dependencies
pip install -r requirements.txt
# Convert merged model to GGUF FP16
python convert_hf_to_gguf.py ../apollo-astralis-8b-merged/ \
--outfile apollo-astralis-8b-f16.gguf \
--outtype f16
# Quantize to Q4_K_M (recommended)
./llama-quantize apollo-astralis-8b-f16.gguf \
apollo_astralis_8b.gguf Q4_K_M
Step 3: Deploy with Ollama
# Create Modelfile
cat > Modelfile <<EOF
from ./apollo_astralis_8b.gguf
template """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
parameter num_predict 256
parameter temperature 0.7
parameter top_p 0.9
parameter top_k 40
parameter repeat_penalty 1.15
parameter stop <|im_start|>
parameter stop <|im_end|>
system """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps. When you're uncertain, you think through possibilities openly and invite collaboration. Your goal is to help users understand not just the answer, but the reasoning process itself."""
EOF
# Create Ollama model
ollama create apollo-astralis -f Modelfile
# Run it!
ollama run apollo-astralis
Memory-Efficient Merge (For Limited RAM)
If you have limited system RAM, use CPU offloading:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
def merge_with_offload(
base_model_name="Qwen/Qwen3-8B",
adapter_model_name="vanta-research/apollo-astralis-8b",
output_path="./apollo-astralis-8b-merged",
max_memory_gb=8
):
"""Merge with CPU offloading for limited RAM."""
# Calculate max memory per device
max_memory = {
0: f"{max_memory_gb}GB", # GPU
"cpu": "30GB" # CPU fallback
}
print("Loading base model with CPU offloading...")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
max_memory=max_memory,
offload_folder="./offload_tmp"
)
print("Loading adapters...")
model = PeftModel.from_pretrained(base_model, adapter_model_name)
print("Merging...")
model = model.merge_and_unload()
print(f"Saving to {output_path}...")
model.save_pretrained(output_path, safe_serialization=True, max_shard_size="2GB")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.save_pretrained(output_path)
print("β
Complete!")
# Run with 8GB GPU limit
merge_with_offload(max_memory_gb=8)
Quantization Options
After merging, you can quantize for reduced memory usage:
8-bit Quantization (bitsandbytes)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model = AutoModelForCausalLM.from_pretrained(
"./apollo-astralis-8b-merged",
quantization_config=quantization_config,
device_map="auto"
)
# Model now uses ~8GB instead of ~16GB
GGUF Quantization (llama.cpp)
Available quantization formats:
- Q4_K_M (4.7GB) - Recommended balance of size and quality
- Q5_K_M (5.7GB) - Higher quality, slightly larger
- Q8_0 (8.5GB) - Near-original quality
- Q2_K (3.4GB) - Smallest, noticeable quality loss
# Quantize to different formats
./llama-quantize apollo-astralis-8b-f16.gguf apollo_astralis_8b.gguf Q4_K_M
./llama-quantize apollo-astralis-8b-f16.gguf apollo-astralis-8b-Q5_K_M.gguf Q5_K_M
./llama-quantize apollo-astralis-8b-f16.gguf apollo-astralis-8b-Q8_0.gguf Q8_0
Verification After Merge
Test your merged model:
def test_merged_model(model_path):
"""Quick test to verify merged model works correctly."""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Test prompt
test_prompt = "Solve for x: 2x + 5 = 17"
inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Test Response:")
print(response)
# Check for Apollo characteristics
checks = {
"thinking_blocks": "<think>" in response or "step" in response.lower(),
"friendly_tone": any(word in response.lower() for word in ["let's", "great", "!"]),
"mathematical": "x" in response and ("=" in response or "17" in response)
}
print("\nβ
Verification:")
for check, passed in checks.items():
print(f" {check}: {'β' if passed else 'β'}")
return all(checks.values())
# Run verification
test_merged_model("./apollo-astralis-8b-merged")
Troubleshooting
"Out of memory during merge"
Solution: Use memory-efficient merge with CPU offloading (see above)
"Merged model gives different outputs"
Solution: Ensure you're using the same generation parameters (temperature, top_p, etc.)
"Cannot load merged model"
Solution: Check PyTorch and Transformers versions match those used for merging
"GGUF conversion fails"
Solution:
- Ensure merged model is in HuggingFace format (not PEFT)
- Update llama.cpp to latest version
- Check model has proper config.json
Performance Comparison
| Method | Inference Speed | Memory Usage | Setup Time | Production Ready |
|---|---|---|---|---|
| PEFT Adapters | ~90% base speed | ~16GB | Instant | β |
| Merged FP16 | 100% base speed | ~16GB | 5-10 min | ββ |
| Merged + 8-bit | ~85% base speed | ~8GB | 5-10 min | ββ |
| GGUF Q4_K_M | ~95% base speed | ~5GB | 15-20 min | βββ |
Recommended Workflow
For Development: Use PEFT adapters directly For Production (Python): Merge to FP16 or 8-bit For Production (Ollama/Local): Convert to GGUF Q4_K_M
Additional Resources
- llama.cpp: https://github.com/ggerganov/llama.cpp
- PEFT Documentation: https://huggingface.co/docs/peft
- Transformers Guide: https://huggingface.co/docs/transformers
- Ollama: https://ollama.ai
Support
If you encounter issues with merging or conversion:
- Check GitHub issues: https://github.com/vanta-research/apollo-astralis-8b/issues
- HuggingFace discussions: https://huggingface.co/vanta-research/apollo-astralis-8b/discussions
- Email: [email protected]
Apollo Astralis 8B - Merge with confidence! π