File size: 6,056 Bytes
c86f5ea 1a9dde1 c86f5ea 46daf80 c86f5ea 46daf80 1a9dde1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: mit
language:
- en
tags:
- mlx
- apple-silicon
- multimodal
- vision-language
- pixtral
- llava
- quantized
- 3bit
- 4bit
- 5bit
- 6bit
pipeline_tag: image-text-to-text
library_name: mlx
base_model:
- ServiceNow-AI/Apriel-1.5-15b-Thinker
---
# Apriel-1.5-15B-Thinker — **MLX 3-bit** (Apple Silicon)
**Format:** MLX (Mac, Apple Silicon)
**Quantization:** **3-bit** (balanced footprint ↔ quality)
**Base:** ServiceNow-AI/Apriel-1.5-15B-Thinker
**Architecture:** Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder)
This repository provides a **3-bit MLX** build of Apriel-1.5-15B-Thinker for **on-device** multimodal inference on Apple Silicon. In side-by-side tests, the **3-bit** variant often:
- uses **significantly less RAM** than 6-bit,
- decodes **faster**, and
- tends to produce **more direct answers** (less “thinking out loud”) at low temperature.
If RAM allows, we also suggest trying **4-bit/5-bit/6-bit** variants (guidance below) for tasks that demand more fidelity.
> Explore other Apriel MLX variants under the `mlx-community` namespace on the Hub.
---
## 🔎 Upstream → MLX summary
Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via **depth upscaling**, **two-stage multimodal continual pretraining**, and **SFT with explicit reasoning traces** (math, coding, science, tool-use).
This MLX release converts the upstream checkpoint with **3-bit** quantization for smaller memory and quick startup on macOS.
---
## 📦 Contents
- `config.json` (MLX config for Pixtral-style VLM)
- `mlx_model*.safetensors` (3-bit shards)
- `tokenizer.json`, `tokenizer_config.json`
- `processor_config.json` / `image_processor.json`
- `model_index.json` and metadata
---
## 🚀 Quickstart (CLI)
**Single image caption**
```python
python -m mlx_vlm.generate \
--model <this-repo-id> \
--image /path/to/image.jpg \
--prompt "Describe this image in two concise sentences." \
--max-tokens 128 --temperature 0.0 --device mps --seed 0
```
## 🔀 Model Family Comparison (2-bit → 6-bit)
> **TL;DR:** Start with **3-bit** for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to **4-bit/5-bit**. Use **6-bit** only when you have headroom and you explicitly instruct concision.
### 📊 Quick Comparison
| Variant | 🧠 Peak RAM\* | ⚡ Speed (rel.) | 🗣️ Output Style (typical) | ✅ Best For | ⚠️ Watch Out For |
|---|---:|:---:|---|---|---|
| **2-bit** | ~7–8 GB | 🔥🔥🔥🔥 | Shortest, most lossy | Minimal RAM demos, quick triage | Detail loss on OCR/dense charts; more omissions |
| **3-bit** | **~9–10 GB** | **🔥🔥🔥🔥** | **Direct, concise** | Default on M1/M2/M3; day-to-day use | May miss tiny text; keep prompts precise |
| **4-bit** | ~11–12.5 GB | 🔥🔥🔥 | More detail retained | Docs/UIs with small text; charts | Slightly slower; still quantization artifacts |
| **5-bit** | ~13–14 GB | 🔥🔥☆ | Higher fidelity | Heavier document/diagram tasks | Needs more RAM; occasional verbose answers |
| **6-bit** | ~14.5–16 GB | 🔥🔥 | Highest MLX fidelity | Max quality under quant | Can “think aloud”; add *be concise* instruction |
\*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length.
---
### 🧪 Example (COCO `000000039769.jpg` — “two cats on a pink couch”)
| Variant | ⏱️ Prompt TPS | ⏱️ Gen TPS | 📈 Peak RAM | 📝 Notes |
|---|---:|---:|---:|---|
| **3-bit** | ~79 tok/s | **~9.79 tok/s** | **~9.57 GB** | Direct answer; minimal “reasoning” leakage |
| **6-bit** | ~78 tok/s | ~6.50 tok/s | ~14.81 GB | Sometimes prints “Here are my reasoning steps…” |
> Settings: `--temperature 0.0 --max-tokens 100 --device mps`. Results vary by Mac model and image resolution; trend is consistent.
---
### 🧭 Choosing the Right Precision
- **I just want it to work on my Mac:** 👉 **3-bit**
- **Tiny fonts / invoices / UI text matter:** 👉 **4-bit**, then **5-bit** if RAM allows
- **I need every drop of quality and have ≥16 GB free:** 👉 **6-bit** (add *“Answer directly; do not include reasoning.”*)
- **I have very little RAM:** 👉 **2-bit** (expect noticeable quality loss)
---
### ⚙️ Suggested Settings (per variant)
| Variant | Max Tokens | Temp | Seed | Notes |
|---|---:|---:|---:|---|
| **2-bit** | 64–96 | 0.0 | 0 | Keep short; single image; expect omissions |
| **3-bit** | 96–128 | 0.0 | 0 | Great default; concise prompts help |
| **4-bit** | 128–192 | 0.0–0.2 | 0 | Better small-text recall; watch RAM |
| **5-bit** | 128–256 | 0.0–0.2 | 0 | Highest OCR among quantized tiers pre-6b |
| **6-bit** | 128–256 | 0.0 | 0 | Add anti-CoT phrasing (see below) |
**Anti-CoT prompt add-on (any bit-width):**
> *“Answer directly. Do **not** include your reasoning steps.”*
(Optional) Add a stop string if your stack supports it (e.g., stop at `"\nHere are my reasoning steps:"`).
---
### 🛠️ One-liners (swap model IDs)
```bash
# 2-bit
python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \
--max-tokens 96 --temperature 0.0 --device mps --seed 0
# 3-bit (recommended default)
python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \
--max-tokens 128 --temperature 0.0 --device mps --seed 0
# 4-bit
python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \
--max-tokens 160 --temperature 0.1 --device mps --seed 0
# 5-bit
python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \
--max-tokens 192 --temperature 0.1 --device mps --seed 0
# 6-bit
python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \
--prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \
--max-tokens 192 --temperature 0.0 --device mps --seed 0 |