|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- mlx |
|
|
- apple-silicon |
|
|
- multimodal |
|
|
- vision-language |
|
|
- pixtral |
|
|
- llava |
|
|
- quantized |
|
|
- 3bit |
|
|
- 4bit |
|
|
- 5bit |
|
|
- 6bit |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: mlx |
|
|
base_model: |
|
|
- ServiceNow-AI/Apriel-1.5-15b-Thinker |
|
|
--- |
|
|
|
|
|
# Apriel-1.5-15B-Thinker — **MLX 3-bit** (Apple Silicon) |
|
|
|
|
|
**Format:** MLX (Mac, Apple Silicon) |
|
|
**Quantization:** **3-bit** (balanced footprint ↔ quality) |
|
|
**Base:** ServiceNow-AI/Apriel-1.5-15B-Thinker |
|
|
**Architecture:** Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder) |
|
|
|
|
|
This repository provides a **3-bit MLX** build of Apriel-1.5-15B-Thinker for **on-device** multimodal inference on Apple Silicon. In side-by-side tests, the **3-bit** variant often: |
|
|
- uses **significantly less RAM** than 6-bit, |
|
|
- decodes **faster**, and |
|
|
- tends to produce **more direct answers** (less “thinking out loud”) at low temperature. |
|
|
|
|
|
If RAM allows, we also suggest trying **4-bit/5-bit/6-bit** variants (guidance below) for tasks that demand more fidelity. |
|
|
|
|
|
> Explore other Apriel MLX variants under the `mlx-community` namespace on the Hub. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔎 Upstream → MLX summary |
|
|
|
|
|
Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via **depth upscaling**, **two-stage multimodal continual pretraining**, and **SFT with explicit reasoning traces** (math, coding, science, tool-use). |
|
|
This MLX release converts the upstream checkpoint with **3-bit** quantization for smaller memory and quick startup on macOS. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Contents |
|
|
|
|
|
- `config.json` (MLX config for Pixtral-style VLM) |
|
|
- `mlx_model*.safetensors` (3-bit shards) |
|
|
- `tokenizer.json`, `tokenizer_config.json` |
|
|
- `processor_config.json` / `image_processor.json` |
|
|
- `model_index.json` and metadata |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Quickstart (CLI) |
|
|
|
|
|
**Single image caption** |
|
|
```python |
|
|
python -m mlx_vlm.generate \ |
|
|
--model <this-repo-id> \ |
|
|
--image /path/to/image.jpg \ |
|
|
--prompt "Describe this image in two concise sentences." \ |
|
|
--max-tokens 128 --temperature 0.0 --device mps --seed 0 |
|
|
``` |
|
|
|
|
|
## 🔀 Model Family Comparison (2-bit → 6-bit) |
|
|
|
|
|
> **TL;DR:** Start with **3-bit** for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to **4-bit/5-bit**. Use **6-bit** only when you have headroom and you explicitly instruct concision. |
|
|
|
|
|
### 📊 Quick Comparison |
|
|
|
|
|
| Variant | 🧠 Peak RAM\* | ⚡ Speed (rel.) | 🗣️ Output Style (typical) | ✅ Best For | ⚠️ Watch Out For | |
|
|
|---|---:|:---:|---|---|---| |
|
|
| **2-bit** | ~7–8 GB | 🔥🔥🔥🔥 | Shortest, most lossy | Minimal RAM demos, quick triage | Detail loss on OCR/dense charts; more omissions | |
|
|
| **3-bit** | **~9–10 GB** | **🔥🔥🔥🔥** | **Direct, concise** | Default on M1/M2/M3; day-to-day use | May miss tiny text; keep prompts precise | |
|
|
| **4-bit** | ~11–12.5 GB | 🔥🔥🔥 | More detail retained | Docs/UIs with small text; charts | Slightly slower; still quantization artifacts | |
|
|
| **5-bit** | ~13–14 GB | 🔥🔥☆ | Higher fidelity | Heavier document/diagram tasks | Needs more RAM; occasional verbose answers | |
|
|
| **6-bit** | ~14.5–16 GB | 🔥🔥 | Highest MLX fidelity | Max quality under quant | Can “think aloud”; add *be concise* instruction | |
|
|
|
|
|
\*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length. |
|
|
|
|
|
--- |
|
|
|
|
|
### 🧪 Example (COCO `000000039769.jpg` — “two cats on a pink couch”) |
|
|
|
|
|
| Variant | ⏱️ Prompt TPS | ⏱️ Gen TPS | 📈 Peak RAM | 📝 Notes | |
|
|
|---|---:|---:|---:|---| |
|
|
| **3-bit** | ~79 tok/s | **~9.79 tok/s** | **~9.57 GB** | Direct answer; minimal “reasoning” leakage | |
|
|
| **6-bit** | ~78 tok/s | ~6.50 tok/s | ~14.81 GB | Sometimes prints “Here are my reasoning steps…” | |
|
|
|
|
|
> Settings: `--temperature 0.0 --max-tokens 100 --device mps`. Results vary by Mac model and image resolution; trend is consistent. |
|
|
|
|
|
--- |
|
|
|
|
|
### 🧭 Choosing the Right Precision |
|
|
|
|
|
- **I just want it to work on my Mac:** 👉 **3-bit** |
|
|
- **Tiny fonts / invoices / UI text matter:** 👉 **4-bit**, then **5-bit** if RAM allows |
|
|
- **I need every drop of quality and have ≥16 GB free:** 👉 **6-bit** (add *“Answer directly; do not include reasoning.”*) |
|
|
- **I have very little RAM:** 👉 **2-bit** (expect noticeable quality loss) |
|
|
|
|
|
--- |
|
|
|
|
|
### ⚙️ Suggested Settings (per variant) |
|
|
|
|
|
| Variant | Max Tokens | Temp | Seed | Notes | |
|
|
|---|---:|---:|---:|---| |
|
|
| **2-bit** | 64–96 | 0.0 | 0 | Keep short; single image; expect omissions | |
|
|
| **3-bit** | 96–128 | 0.0 | 0 | Great default; concise prompts help | |
|
|
| **4-bit** | 128–192 | 0.0–0.2 | 0 | Better small-text recall; watch RAM | |
|
|
| **5-bit** | 128–256 | 0.0–0.2 | 0 | Highest OCR among quantized tiers pre-6b | |
|
|
| **6-bit** | 128–256 | 0.0 | 0 | Add anti-CoT phrasing (see below) | |
|
|
|
|
|
**Anti-CoT prompt add-on (any bit-width):** |
|
|
> *“Answer directly. Do **not** include your reasoning steps.”* |
|
|
|
|
|
(Optional) Add a stop string if your stack supports it (e.g., stop at `"\nHere are my reasoning steps:"`). |
|
|
|
|
|
--- |
|
|
|
|
|
### 🛠️ One-liners (swap model IDs) |
|
|
|
|
|
```bash |
|
|
# 2-bit |
|
|
python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \ |
|
|
--max-tokens 96 --temperature 0.0 --device mps --seed 0 |
|
|
|
|
|
# 3-bit (recommended default) |
|
|
python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \ |
|
|
--max-tokens 128 --temperature 0.0 --device mps --seed 0 |
|
|
|
|
|
# 4-bit |
|
|
python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \ |
|
|
--max-tokens 160 --temperature 0.1 --device mps --seed 0 |
|
|
|
|
|
# 5-bit |
|
|
python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \ |
|
|
--max-tokens 192 --temperature 0.1 --device mps --seed 0 |
|
|
|
|
|
# 6-bit |
|
|
python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \ |
|
|
--prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \ |
|
|
--max-tokens 192 --temperature 0.0 --device mps --seed 0 |