File size: 6,056 Bytes
c86f5ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a9dde1
 
c86f5ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46daf80
c86f5ea
 
 
 
 
46daf80
1a9dde1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: mit
language:
- en
tags:
- mlx
- apple-silicon
- multimodal
- vision-language
- pixtral
- llava
- quantized
- 3bit
- 4bit
- 5bit
- 6bit
pipeline_tag: image-text-to-text
library_name: mlx
base_model:
- ServiceNow-AI/Apriel-1.5-15b-Thinker
---

# Apriel-1.5-15B-Thinker — **MLX 3-bit** (Apple Silicon)

**Format:** MLX (Mac, Apple Silicon)  
**Quantization:** **3-bit** (balanced footprint ↔ quality)  
**Base:** ServiceNow-AI/Apriel-1.5-15B-Thinker  
**Architecture:** Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder)

This repository provides a **3-bit MLX** build of Apriel-1.5-15B-Thinker for **on-device** multimodal inference on Apple Silicon. In side-by-side tests, the **3-bit** variant often:
- uses **significantly less RAM** than 6-bit,
- decodes **faster**, and
- tends to produce **more direct answers** (less “thinking out loud”) at low temperature.

If RAM allows, we also suggest trying **4-bit/5-bit/6-bit** variants (guidance below) for tasks that demand more fidelity.

> Explore other Apriel MLX variants under the `mlx-community` namespace on the Hub.

---

## 🔎 Upstream → MLX summary

Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via **depth upscaling**, **two-stage multimodal continual pretraining**, and **SFT with explicit reasoning traces** (math, coding, science, tool-use).  
This MLX release converts the upstream checkpoint with **3-bit** quantization for smaller memory and quick startup on macOS.

---

## 📦 Contents

- `config.json` (MLX config for Pixtral-style VLM)  
- `mlx_model*.safetensors` (3-bit shards)  
- `tokenizer.json`, `tokenizer_config.json`  
- `processor_config.json` / `image_processor.json`  
- `model_index.json` and metadata

---

## 🚀 Quickstart (CLI)

**Single image caption**
```python
python -m mlx_vlm.generate \
  --model <this-repo-id> \
  --image /path/to/image.jpg \
  --prompt "Describe this image in two concise sentences." \
  --max-tokens 128 --temperature 0.0 --device mps --seed 0
```

## 🔀 Model Family Comparison (2-bit → 6-bit)

> **TL;DR:** Start with **3-bit** for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to **4-bit/5-bit**. Use **6-bit** only when you have headroom and you explicitly instruct concision.

### 📊 Quick Comparison

| Variant | 🧠 Peak RAM\* | ⚡ Speed (rel.) | 🗣️ Output Style (typical) | ✅ Best For | ⚠️ Watch Out For |
|---|---:|:---:|---|---|---|
| **2-bit** | ~7–8 GB | 🔥🔥🔥🔥 | Shortest, most lossy | Minimal RAM demos, quick triage | Detail loss on OCR/dense charts; more omissions |
| **3-bit** | **~9–10 GB** | **🔥🔥🔥🔥** | **Direct, concise** | Default on M1/M2/M3; day-to-day use | May miss tiny text; keep prompts precise |
| **4-bit** | ~11–12.5 GB | 🔥🔥🔥 | More detail retained | Docs/UIs with small text; charts | Slightly slower; still quantization artifacts |
| **5-bit** | ~13–14 GB | 🔥🔥☆ | Higher fidelity | Heavier document/diagram tasks | Needs more RAM; occasional verbose answers |
| **6-bit** | ~14.5–16 GB | 🔥🔥 | Highest MLX fidelity | Max quality under quant | Can “think aloud”; add *be concise* instruction |

\*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length.

---

### 🧪 Example (COCO `000000039769.jpg` — “two cats on a pink couch”)

| Variant | ⏱️ Prompt TPS | ⏱️ Gen TPS | 📈 Peak RAM | 📝 Notes |
|---|---:|---:|---:|---|
| **3-bit** | ~79 tok/s | **~9.79 tok/s** | **~9.57 GB** | Direct answer; minimal “reasoning” leakage |
| **6-bit** | ~78 tok/s | ~6.50 tok/s | ~14.81 GB | Sometimes prints “Here are my reasoning steps…” |

> Settings: `--temperature 0.0 --max-tokens 100 --device mps`. Results vary by Mac model and image resolution; trend is consistent.

---

### 🧭 Choosing the Right Precision

- **I just want it to work on my Mac:** 👉 **3-bit**  
- **Tiny fonts / invoices / UI text matter:** 👉 **4-bit**, then **5-bit** if RAM allows  
- **I need every drop of quality and have ≥16 GB free:** 👉 **6-bit** (add *“Answer directly; do not include reasoning.”*)  
- **I have very little RAM:** 👉 **2-bit** (expect noticeable quality loss)

---

### ⚙️ Suggested Settings (per variant)

| Variant | Max Tokens | Temp | Seed | Notes |
|---|---:|---:|---:|---|
| **2-bit** | 64–96 | 0.0 | 0 | Keep short; single image; expect omissions |
| **3-bit** | 96–128 | 0.0 | 0 | Great default; concise prompts help |
| **4-bit** | 128–192 | 0.0–0.2 | 0 | Better small-text recall; watch RAM |
| **5-bit** | 128–256 | 0.0–0.2 | 0 | Highest OCR among quantized tiers pre-6b |
| **6-bit** | 128–256 | 0.0 | 0 | Add anti-CoT phrasing (see below) |

**Anti-CoT prompt add-on (any bit-width):**  
> *“Answer directly. Do **not** include your reasoning steps.”*

(Optional) Add a stop string if your stack supports it (e.g., stop at `"\nHere are my reasoning steps:"`).

---

### 🛠️ One-liners (swap model IDs)

```bash
# 2-bit
python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \
  --max-tokens 96 --temperature 0.0 --device mps --seed 0

# 3-bit (recommended default)
python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \
  --max-tokens 128 --temperature 0.0 --device mps --seed 0

# 4-bit
python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \
  --max-tokens 160 --temperature 0.1 --device mps --seed 0

# 5-bit
python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \
  --max-tokens 192 --temperature 0.1 --device mps --seed 0

# 6-bit
python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \
  --prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \
  --max-tokens 192 --temperature 0.0 --device mps --seed 0