File size: 10,433 Bytes
1b84476
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4ec38
1b84476
 
 
 
 
 
 
3c4ec38
1b84476
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4ec38
1b84476
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# Apollo Astralis 8B - Adapter Merge Guide

## Overview

This guide explains how to use the Apollo Astralis 8B LoRA adapters with the base Qwen3-8B model. You can either:

1. **Use adapters directly** with PEFT (recommended for development)
2. **Merge adapters** into the base model (recommended for production)

## Option 1: Use Adapters with PEFT (Recommended)

The simplest approach - no merging required:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load and apply LoRA adapters
model = PeftModel.from_pretrained(model, "vanta-research/apollo-astralis-8b")
model.eval()

print("Apollo Astralis 8B ready!")
```

### Advantages
- βœ… Simple and straightforward
- βœ… No extra disk space required
- βœ… Can easily swap between base and fine-tuned models
- βœ… Faster initial loading

### Disadvantages
- ❌ Slightly slower inference (adapter application overhead)
- ❌ Requires PEFT library

## Option 2: Merge Adapters into Base Model

For production deployments requiring maximum inference speed:

### Step 1: Install Dependencies

```bash
pip install torch transformers peft accelerate
```

### Step 2: Merge Script

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

def merge_adapters(
    base_model_name="Qwen/Qwen3-8B",
    adapter_model_name="vanta-research/apollo-astralis-8b",
    output_path="./apollo-astralis-8b-merged"
):
    """Merge LoRA adapters into base model."""
    
    print(f"Loading base model: {base_model_name}")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    print(f"Loading adapters: {adapter_model_name}")
    model = PeftModel.from_pretrained(base_model, adapter_model_name)
    
    print("Merging adapters into base model...")
    model = model.merge_and_unload()
    
    print(f"Saving merged model to: {output_path}")
    model.save_pretrained(output_path, safe_serialization=True)
    
    print("Saving tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.save_pretrained(output_path)
    
    print("βœ… Merge complete!")
    return model

# Run merge
merged_model = merge_adapters()
```

### Step 3: Use Merged Model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load merged model
model = AutoModelForCausalLM.from_pretrained(
    "./apollo-astralis-8b-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./apollo-astralis-8b-merged")

# Use normally - no PEFT required!
```

### Advantages
- βœ… Faster inference (no adapter overhead)
- βœ… No PEFT dependency required
- βœ… Easier to quantize and convert to other formats
- βœ… Better for production deployment

### Disadvantages
- ❌ Requires ~16GB disk space for merged model
- ❌ One-time merge process required
- ❌ Cannot easily swap back to base model

## Option 3: Convert to GGUF for Ollama

For efficient local deployment with Ollama:

### Step 1: Merge Adapters (see Option 2)

### Step 2: Convert to GGUF

```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install Python dependencies
pip install -r requirements.txt

# Convert merged model to GGUF FP16
python convert_hf_to_gguf.py ../apollo-astralis-8b-merged/ \
  --outfile apollo-astralis-8b-f16.gguf \
  --outtype f16

# Quantize to Q4_K_M (recommended)
./llama-quantize apollo-astralis-8b-f16.gguf \
    apollo_astralis_8b.gguf Q4_K_M
```

### Step 3: Deploy with Ollama

```bash
# Create Modelfile
cat > Modelfile <<EOF
from ./apollo_astralis_8b.gguf

template """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

parameter num_predict 256
parameter temperature 0.7
parameter top_p 0.9
parameter top_k 40
parameter repeat_penalty 1.15
parameter stop <|im_start|>
parameter stop <|im_end|>

system """You are Apollo, a collaborative AI assistant specializing in reasoning and problem-solving. You approach each question with genuine curiosity and enthusiasm, breaking down complex problems into clear steps. When you're uncertain, you think through possibilities openly and invite collaboration. Your goal is to help users understand not just the answer, but the reasoning process itself."""
EOF

# Create Ollama model
ollama create apollo-astralis -f Modelfile

# Run it!
ollama run apollo-astralis
```

## Memory-Efficient Merge (For Limited RAM)

If you have limited system RAM, use CPU offloading:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

def merge_with_offload(
    base_model_name="Qwen/Qwen3-8B",
    adapter_model_name="vanta-research/apollo-astralis-8b",
    output_path="./apollo-astralis-8b-merged",
    max_memory_gb=8
):
    """Merge with CPU offloading for limited RAM."""
    
    # Calculate max memory per device
    max_memory = {
        0: f"{max_memory_gb}GB",  # GPU
        "cpu": "30GB"  # CPU fallback
    }
    
    print("Loading base model with CPU offloading...")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        max_memory=max_memory,
        offload_folder="./offload_tmp"
    )
    
    print("Loading adapters...")
    model = PeftModel.from_pretrained(base_model, adapter_model_name)
    
    print("Merging...")
    model = model.merge_and_unload()
    
    print(f"Saving to {output_path}...")
    model.save_pretrained(output_path, safe_serialization=True, max_shard_size="2GB")
    
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.save_pretrained(output_path)
    
    print("βœ… Complete!")

# Run with 8GB GPU limit
merge_with_offload(max_memory_gb=8)
```

## Quantization Options

After merging, you can quantize for reduced memory usage:

### 8-bit Quantization (bitsandbytes)

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    "./apollo-astralis-8b-merged",
    quantization_config=quantization_config,
    device_map="auto"
)

# Model now uses ~8GB instead of ~16GB
```

### GGUF Quantization (llama.cpp)

Available quantization formats:
- **Q4_K_M** (4.7GB) - Recommended balance of size and quality
- **Q5_K_M** (5.7GB) - Higher quality, slightly larger
- **Q8_0** (8.5GB) - Near-original quality
- **Q2_K** (3.4GB) - Smallest, noticeable quality loss

```bash
# Quantize to different formats
./llama-quantize apollo-astralis-8b-f16.gguf apollo_astralis_8b.gguf Q4_K_M
./llama-quantize apollo-astralis-8b-f16.gguf apollo-astralis-8b-Q5_K_M.gguf Q5_K_M
./llama-quantize apollo-astralis-8b-f16.gguf apollo-astralis-8b-Q8_0.gguf Q8_0
```

## Verification After Merge

Test your merged model:

```python
def test_merged_model(model_path):
    """Quick test to verify merged model works correctly."""
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    # Test prompt
    test_prompt = "Solve for x: 2x + 5 = 17"
    
    inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            do_sample=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Test Response:")
    print(response)
    
    # Check for Apollo characteristics
    checks = {
        "thinking_blocks": "<think>" in response or "step" in response.lower(),
        "friendly_tone": any(word in response.lower() for word in ["let's", "great", "!"]),
        "mathematical": "x" in response and ("=" in response or "17" in response)
    }
    
    print("\nβœ… Verification:")
    for check, passed in checks.items():
        print(f"  {check}: {'βœ“' if passed else 'βœ—'}")
    
    return all(checks.values())

# Run verification
test_merged_model("./apollo-astralis-8b-merged")
```

## Troubleshooting

### "Out of memory during merge"
**Solution**: Use memory-efficient merge with CPU offloading (see above)

### "Merged model gives different outputs"
**Solution**: Ensure you're using the same generation parameters (temperature, top_p, etc.)

### "Cannot load merged model"
**Solution**: Check PyTorch and Transformers versions match those used for merging

### "GGUF conversion fails"
**Solution**: 
1. Ensure merged model is in HuggingFace format (not PEFT)
2. Update llama.cpp to latest version
3. Check model has proper config.json

## Performance Comparison

| Method | Inference Speed | Memory Usage | Setup Time | Production Ready |
|--------|----------------|--------------|------------|------------------|
| PEFT Adapters | ~90% base speed | ~16GB | Instant | βœ“ |
| Merged FP16 | 100% base speed | ~16GB | 5-10 min | βœ“βœ“ |
| Merged + 8-bit | ~85% base speed | ~8GB | 5-10 min | βœ“βœ“ |
| GGUF Q4_K_M | ~95% base speed | ~5GB | 15-20 min | βœ“βœ“βœ“ |

## Recommended Workflow

**For Development**: Use PEFT adapters directly
**For Production (Python)**: Merge to FP16 or 8-bit
**For Production (Ollama/Local)**: Convert to GGUF Q4_K_M

## Additional Resources

- **llama.cpp**: https://github.com/ggerganov/llama.cpp
- **PEFT Documentation**: https://huggingface.co/docs/peft
- **Transformers Guide**: https://huggingface.co/docs/transformers
- **Ollama**: https://ollama.ai

## Support

If you encounter issues with merging or conversion:
- Check GitHub issues: https://github.com/vanta-research/apollo-astralis-8b/issues
- HuggingFace discussions: https://huggingface.co/vanta-research/apollo-astralis-8b/discussions
- Email: [email protected]

---

*Apollo Astralis 8B - Merge with confidence! πŸš€*