README.md · mlx-community/Apriel-1.5-15b-Thinker-3bit-MLX at main

File size: 6,056 Bytes

---
license: mit
language:
- en
tags:
- mlx
- apple-silicon
- multimodal
- vision-language
- pixtral
- llava
- quantized
- 3bit
- 4bit
- 5bit
- 6bit
pipeline_tag: image-text-to-text
library_name: mlx
base_model:
- ServiceNow-AI/Apriel-1.5-15b-Thinker
---

# Apriel-1.5-15B-Thinker — **MLX 3-bit** (Apple Silicon)

**Format:** MLX (Mac, Apple Silicon)  
**Quantization:** **3-bit** (balanced footprint ↔ quality)  
**Base:** ServiceNow-AI/Apriel-1.5-15B-Thinker  
**Architecture:** Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder)

This repository provides a **3-bit MLX** build of Apriel-1.5-15B-Thinker for **on-device** multimodal inference on Apple Silicon. In side-by-side tests, the **3-bit** variant often:
- uses **significantly less RAM** than 6-bit,
- decodes **faster**, and
- tends to produce **more direct answers** (less “thinking out loud”) at low temperature.

If RAM allows, we also suggest trying **4-bit/5-bit/6-bit** variants (guidance below) for tasks that demand more fidelity.

> Explore other Apriel MLX variants under the `mlx-community` namespace on the Hub.

---

## 🔎 Upstream → MLX summary

Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via **depth upscaling**, **two-stage multimodal continual pretraining**, and **SFT with explicit reasoning traces** (math, coding, science, tool-use).  
This MLX release converts the upstream checkpoint with **3-bit** quantization for smaller memory and quick startup on macOS.

---

## 📦 Contents

- `config.json` (MLX config for Pixtral-style VLM)  
- `mlx_model*.safetensors` (3-bit shards)  
- `tokenizer.json`, `tokenizer_config.json`  
- `processor_config.json` / `image_processor.json`  
- `model_index.json` and metadata

---

## 🚀 Quickstart (CLI)

**Single image caption**
```python
python -m mlx_vlm.generate \
  --model <this-repo-id> \
  --image /path/to/image.jpg \
  --prompt "Describe this image in two concise sentences." \
  --max-tokens 128 --temperature 0.0 --device mps --seed 0
```

## 🔀 Model Family Comparison (2-bit → 6-bit)

> **TL;DR:** Start with **3-bit** for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to **4-bit/5-bit**. Use **6-bit** only when you have headroom and you explicitly instruct concision.

### 📊 Quick Comparison

| Variant | 🧠 Peak RAM\* | ⚡ Speed (rel.) | 🗣️ Output Style (typical) | ✅ Best For | ⚠️ Watch Out For |
|---|---:|:---:|---|---|---|
| **2-bit** | ~7–8 GB | 🔥🔥🔥🔥 | Shortest, most lossy | Minimal RAM demos, quick triage | Detail loss on OCR/dense charts; more omissions |
| **3-bit** | **~9–10 GB** | **🔥🔥🔥🔥** | **Direct, concise** | Default on M1/M2/M3; day-to-day use | May miss tiny text; keep prompts precise |
| **4-bit** | ~11–12.5 GB | 🔥🔥🔥 | More detail retained | Docs/UIs with small text; charts | Slightly slower; still quantization artifacts |
| **5-bit** | ~13–14 GB | 🔥🔥☆ | Higher fidelity | Heavier document/diagram tasks | Needs more RAM; occasional verbose answers |
| **6-bit** | ~14.5–16 GB | 🔥🔥 | Highest MLX fidelity | Max quality under quant | Can “think aloud”; add *be concise* instruction |

\*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length.

---

### 🧪 Example (COCO `000000039769.jpg` — “two cats on a pink couch”)

| Variant | ⏱️ Prompt TPS | ⏱️ Gen TPS | 📈 Peak RAM | 📝 Notes |
|---|---:|---:|---:|---|
| **3-bit** | ~79 tok/s | **~9.79 tok/s** | **~9.57 GB** | Direct answer; minimal “reasoning” leakage |
| **6-bit** | ~78 tok/s | ~6.50 tok/s | ~14.81 GB | Sometimes prints “Here are my reasoning steps…” |

> Settings: `--temperature 0.0 --max-tokens 100 --device mps`. Results vary by Mac model and image resolution; trend is consistent.

---

### 🧭 Choosing the Right Precision

- **I just want it to work on my Mac:** 👉 **3-bit**  
- **Tiny fonts / invoices / UI text matter:** 👉 **4-bit**, then **5-bit** if RAM allows  
- **I need every drop of quality and have ≥16 GB free:** 👉 **6-bit** (add *“Answer directly; do not include reasoning.”*)  
- **I have very little RAM:** 👉 **2-bit** (expect noticeable quality loss)

---

### ⚙️ Suggested Settings (per variant)

| Variant | Max Tokens | Temp | Seed | Notes |
|---|---:|---:|---:|---|
| **2-bit** | 64–96 | 0.0 | 0 | Keep short; single image; expect omissions |
| **3-bit** | 96–128 | 0.0 | 0 | Great default; concise prompts help |
| **4-bit** | 128–192 | 0.0–0.2 | 0 | Better small-text recall; watch RAM |
| **5-bit** | 128–256 | 0.0–0.2 | 0 | Highest OCR among quantized tiers pre-6b |
| **6-bit** | 128–256 | 0.0 | 0 | Add anti-CoT phrasing (see below) |

**Anti-CoT prompt add-on (any bit-width):**  
> *“Answer directly. Do **not** include your reasoning steps.”*

(Optional) Add a stop string if your stack supports it (e.g., stop at `"\nHere are my reasoning steps:"`).

---

### 🛠️ One-liners (swap model IDs)

```bash
# 2-bit
python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \
  --max-tokens 96 --temperature 0.0 --device mps --seed 0

# 3-bit (recommended default)
python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \
  --max-tokens 128 --temperature 0.0 --device mps --seed 0

# 4-bit
python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \
  --max-tokens 160 --temperature 0.1 --device mps --seed 0

# 5-bit
python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \
  --max-tokens 192 --temperature 0.1 --device mps --seed 0

# 6-bit
python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \
  --prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \
  --max-tokens 192 --temperature 0.0 --device mps --seed 0