File size: 10,433 Bytes
1b84476 3c4ec38 1b84476 3c4ec38 1b84476 3c4ec38 1b84476 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
# Apollo Astralis 8B - Adapter Merge Guide
## Overview
This guide explains how to use the Apollo Astralis 8B LoRA adapters with the base Qwen3-8B model. You can either:
1. **Use adapters directly** with PEFT (recommended for development)
2. **Merge adapters** into the base model (recommended for production)
## Option 1: Use Adapters with PEFT (Recommended)
The simplest approach - no merging required:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model
base_model = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load and apply LoRA adapters
model = PeftModel.from_pretrained(model, "vanta-research/apollo-astralis-8b")
model.eval()
print("Apollo Astralis 8B ready!")
```
### Advantages
- β
Simple and straightforward
- β
No extra disk space required
- β
Can easily swap between base and fine-tuned models
- β
Faster initial loading
### Disadvantages
- β Slightly slower inference (adapter application overhead)
- β Requires PEFT library
## Option 2: Merge Adapters into Base Model
For production deployments requiring maximum inference speed:
### Step 1: Install Dependencies
```bash
pip install torch transformers peft accelerate
```
### Step 2: Merge Script
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
def merge_adapters(
base_model_name="Qwen/Qwen3-8B",
adapter_model_name="vanta-research/apollo-astralis-8b",
output_path="./apollo-astralis-8b-merged"
):
"""Merge LoRA adapters into base model."""
print(f"Loading base model: {base_model_name}")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
print(f"Loading adapters: {adapter_model_name}")
model = PeftModel.from_pretrained(base_model, adapter_model_name)
print("Merging adapters into base model...")
model = model.merge_and_unload()
print(f"Saving merged model to: {output_path}")
model.save_pretrained(output_path, safe_serialization=True)
print("Saving tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.save_pretrained(output_path)
print("β
Merge complete!")
return model
# Run merge
merged_model = merge_adapters()
```
### Step 3: Use Merged Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load merged model
model = AutoModelForCausalLM.from_pretrained(
"./apollo-astralis-8b-merged",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./apollo-astralis-8b-merged")
# Use normally - no PEFT required!
```
### Advantages
- β
Faster inference (no adapter overhead)
- β
No PEFT dependency required
- β
Easier to quantize and convert to other formats
- β
Better for production deployment
### Disadvantages
- β Requires ~16GB disk space for merged model
- β One-time merge process required
- β Cannot easily swap back to base model
## Option 3: Convert to GGUF for Ollama
For efficient local deployment with Ollama:
### Step 1: Merge Adapters (see Option 2)
### Step 2: Convert to GGUF
```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Install Python dependencies
pip install -r requirements.txt
# Convert merged model to GGUF FP16
python convert_hf_to_gguf.py ../apollo-astralis-8b-merged/ \
--outfile apollo-astralis-8b-f16.gguf \
--outtype f16
# Quantize to Q4_K_M (recommended)
./llama-quantize apollo-astralis-8b-f16.gguf \
apollo_astralis_8b.gguf Q4_K_M
```
### Step 3: Deploy with Ollama
```bash
# Create Modelfile
cat > Modelfile <<EOF
from ./apollo_astralis_8b.gguf
template """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
parameter num_predict 256
parameter temperature 0.7
parameter top_p 0.9
parameter top_k 40
parameter repeat_penalty 1.15
parameter stop <|im_start|>
parameter stop <|im_end|>
system """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps. When you're uncertain, you think through possibilities openly and invite collaboration. Your goal is to help users understand not just the answer, but the reasoning process itself."""
EOF
# Create Ollama model
ollama create apollo-astralis -f Modelfile
# Run it!
ollama run apollo-astralis
```
## Memory-Efficient Merge (For Limited RAM)
If you have limited system RAM, use CPU offloading:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
def merge_with_offload(
base_model_name="Qwen/Qwen3-8B",
adapter_model_name="vanta-research/apollo-astralis-8b",
output_path="./apollo-astralis-8b-merged",
max_memory_gb=8
):
"""Merge with CPU offloading for limited RAM."""
# Calculate max memory per device
max_memory = {
0: f"{max_memory_gb}GB", # GPU
"cpu": "30GB" # CPU fallback
}
print("Loading base model with CPU offloading...")
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
max_memory=max_memory,
offload_folder="./offload_tmp"
)
print("Loading adapters...")
model = PeftModel.from_pretrained(base_model, adapter_model_name)
print("Merging...")
model = model.merge_and_unload()
print(f"Saving to {output_path}...")
model.save_pretrained(output_path, safe_serialization=True, max_shard_size="2GB")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.save_pretrained(output_path)
print("β
Complete!")
# Run with 8GB GPU limit
merge_with_offload(max_memory_gb=8)
```
## Quantization Options
After merging, you can quantize for reduced memory usage:
### 8-bit Quantization (bitsandbytes)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model = AutoModelForCausalLM.from_pretrained(
"./apollo-astralis-8b-merged",
quantization_config=quantization_config,
device_map="auto"
)
# Model now uses ~8GB instead of ~16GB
```
### GGUF Quantization (llama.cpp)
Available quantization formats:
- **Q4_K_M** (4.7GB) - Recommended balance of size and quality
- **Q5_K_M** (5.7GB) - Higher quality, slightly larger
- **Q8_0** (8.5GB) - Near-original quality
- **Q2_K** (3.4GB) - Smallest, noticeable quality loss
```bash
# Quantize to different formats
./llama-quantize apollo-astralis-8b-f16.gguf apollo_astralis_8b.gguf Q4_K_M
./llama-quantize apollo-astralis-8b-f16.gguf apollo-astralis-8b-Q5_K_M.gguf Q5_K_M
./llama-quantize apollo-astralis-8b-f16.gguf apollo-astralis-8b-Q8_0.gguf Q8_0
```
## Verification After Merge
Test your merged model:
```python
def test_merged_model(model_path):
"""Quick test to verify merged model works correctly."""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Test prompt
test_prompt = "Solve for x: 2x + 5 = 17"
inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Test Response:")
print(response)
# Check for Apollo characteristics
checks = {
"thinking_blocks": "<think>" in response or "step" in response.lower(),
"friendly_tone": any(word in response.lower() for word in ["let's", "great", "!"]),
"mathematical": "x" in response and ("=" in response or "17" in response)
}
print("\nβ
Verification:")
for check, passed in checks.items():
print(f" {check}: {'β' if passed else 'β'}")
return all(checks.values())
# Run verification
test_merged_model("./apollo-astralis-8b-merged")
```
## Troubleshooting
### "Out of memory during merge"
**Solution**: Use memory-efficient merge with CPU offloading (see above)
### "Merged model gives different outputs"
**Solution**: Ensure you're using the same generation parameters (temperature, top_p, etc.)
### "Cannot load merged model"
**Solution**: Check PyTorch and Transformers versions match those used for merging
### "GGUF conversion fails"
**Solution**:
1. Ensure merged model is in HuggingFace format (not PEFT)
2. Update llama.cpp to latest version
3. Check model has proper config.json
## Performance Comparison
| Method | Inference Speed | Memory Usage | Setup Time | Production Ready |
|--------|----------------|--------------|------------|------------------|
| PEFT Adapters | ~90% base speed | ~16GB | Instant | β |
| Merged FP16 | 100% base speed | ~16GB | 5-10 min | ββ |
| Merged + 8-bit | ~85% base speed | ~8GB | 5-10 min | ββ |
| GGUF Q4_K_M | ~95% base speed | ~5GB | 15-20 min | βββ |
## Recommended Workflow
**For Development**: Use PEFT adapters directly
**For Production (Python)**: Merge to FP16 or 8-bit
**For Production (Ollama/Local)**: Convert to GGUF Q4_K_M
## Additional Resources
- **llama.cpp**: https://github.com/ggerganov/llama.cpp
- **PEFT Documentation**: https://huggingface.co/docs/peft
- **Transformers Guide**: https://huggingface.co/docs/transformers
- **Ollama**: https://ollama.ai
## Support
If you encounter issues with merging or conversion:
- Check GitHub issues: https://github.com/vanta-research/apollo-astralis-8b/issues
- HuggingFace discussions: https://huggingface.co/vanta-research/apollo-astralis-8b/discussions
- Email: [email protected]
---
*Apollo Astralis 8B - Merge with confidence! π*
|