README.md · mlx-community/Apriel-1.5-15b-Thinker-3bit-MLX at main

Apriel-1.5-15b-Thinker-3bit-MLX / README.md

Susant-Achary

Update README.md

46daf80 verified about 1 month ago

preview code

raw

history blame contribute delete

6.06 kB

	---
	license: mit
	language:
	- en
	tags:
	- mlx
	- apple-silicon
	- multimodal
	- vision-language
	- pixtral
	- llava
	- quantized
	- 3bit
	- 4bit
	- 5bit
	- 6bit
	pipeline_tag: image-text-to-text
	library_name: mlx
	base_model:
	- ServiceNow-AI/Apriel-1.5-15b-Thinker
	---

	# Apriel-1.5-15B-Thinker — MLX 3-bit (Apple Silicon)

	Format: MLX (Mac, Apple Silicon)
	Quantization: 3-bit (balanced footprint ↔ quality)
	Base: ServiceNow-AI/Apriel-1.5-15B-Thinker
	Architecture: Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder)

	This repository provides a 3-bit MLX build of Apriel-1.5-15B-Thinker for on-device multimodal inference on Apple Silicon. In side-by-side tests, the 3-bit variant often:
	- uses significantly less RAM than 6-bit,
	- decodes faster, and
	- tends to produce more direct answers (less “thinking out loud”) at low temperature.

	If RAM allows, we also suggest trying 4-bit/5-bit/6-bit variants (guidance below) for tasks that demand more fidelity.

	> Explore other Apriel MLX variants under the `mlx-community` namespace on the Hub.

	---

	## 🔎 Upstream → MLX summary

	Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via depth upscaling, two-stage multimodal continual pretraining, and SFT with explicit reasoning traces (math, coding, science, tool-use).
	This MLX release converts the upstream checkpoint with 3-bit quantization for smaller memory and quick startup on macOS.

	---

	## 📦 Contents

	- `config.json` (MLX config for Pixtral-style VLM)
	- `mlx_model*.safetensors` (3-bit shards)
	- `tokenizer.json`, `tokenizer_config.json`
	- `processor_config.json` / `image_processor.json`
	- `model_index.json` and metadata

	---

	## 🚀 Quickstart (CLI)

	Single image caption
	```python
	python -m mlx_vlm.generate \
	--model <this-repo-id> \
	--image /path/to/image.jpg \
	--prompt "Describe this image in two concise sentences." \
	--max-tokens 128 --temperature 0.0 --device mps --seed 0
	```

	## 🔀 Model Family Comparison (2-bit → 6-bit)

	> TL;DR: Start with 3-bit for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to 4-bit/5-bit. Use 6-bit only when you have headroom and you explicitly instruct concision.

	### 📊 Quick Comparison

	\| Variant \| 🧠 Peak RAM\* \| ⚡ Speed (rel.) \| 🗣️ Output Style (typical) \| ✅ Best For \| ⚠️ Watch Out For \|
	\|---\|---:\|:---:\|---\|---\|---\|
	\| 2-bit \| ~7–8 GB \| 🔥🔥🔥🔥 \| Shortest, most lossy \| Minimal RAM demos, quick triage \| Detail loss on OCR/dense charts; more omissions \|
	\| 3-bit \| ~9–10 GB \| 🔥🔥🔥🔥 \| Direct, concise \| Default on M1/M2/M3; day-to-day use \| May miss tiny text; keep prompts precise \|
	\| 4-bit \| ~11–12.5 GB \| 🔥🔥🔥 \| More detail retained \| Docs/UIs with small text; charts \| Slightly slower; still quantization artifacts \|
	\| 5-bit \| ~13–14 GB \| 🔥🔥☆ \| Higher fidelity \| Heavier document/diagram tasks \| Needs more RAM; occasional verbose answers \|
	\| 6-bit \| ~14.5–16 GB \| 🔥🔥 \| Highest MLX fidelity \| Max quality under quant \| Can “think aloud”; add be concise instruction \|

	\*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length.

	---

	### 🧪 Example (COCO `000000039769.jpg` — “two cats on a pink couch”)

	\| Variant \| ⏱️ Prompt TPS \| ⏱️ Gen TPS \| 📈 Peak RAM \| 📝 Notes \|
	\|---\|---:\|---:\|---:\|---\|
	\| 3-bit \| ~79 tok/s \| ~9.79 tok/s \| ~9.57 GB \| Direct answer; minimal “reasoning” leakage \|
	\| 6-bit \| ~78 tok/s \| ~6.50 tok/s \| ~14.81 GB \| Sometimes prints “Here are my reasoning steps…” \|

	> Settings: `--temperature 0.0 --max-tokens 100 --device mps`. Results vary by Mac model and image resolution; trend is consistent.

	---

	### 🧭 Choosing the Right Precision

	- I just want it to work on my Mac: 👉 3-bit
	- Tiny fonts / invoices / UI text matter: 👉 4-bit, then 5-bit if RAM allows
	- I need every drop of quality and have ≥16 GB free: 👉 6-bit (add “Answer directly; do not include reasoning.”)
	- I have very little RAM: 👉 2-bit (expect noticeable quality loss)

	---

	### ⚙️ Suggested Settings (per variant)

	\| Variant \| Max Tokens \| Temp \| Seed \| Notes \|
	\|---\|---:\|---:\|---:\|---\|
	\| 2-bit \| 64–96 \| 0.0 \| 0 \| Keep short; single image; expect omissions \|
	\| 3-bit \| 96–128 \| 0.0 \| 0 \| Great default; concise prompts help \|
	\| 4-bit \| 128–192 \| 0.0–0.2 \| 0 \| Better small-text recall; watch RAM \|
	\| 5-bit \| 128–256 \| 0.0–0.2 \| 0 \| Highest OCR among quantized tiers pre-6b \|
	\| 6-bit \| 128–256 \| 0.0 \| 0 \| Add anti-CoT phrasing (see below) \|

	Anti-CoT prompt add-on (any bit-width):
	> “Answer directly. Do not* include your reasoning steps.”*

	(Optional) Add a stop string if your stack supports it (e.g., stop at `"\nHere are my reasoning steps:"`).

	---

	### 🛠️ One-liners (swap model IDs)

	```bash
	# 2-bit
	python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \
	--max-tokens 96 --temperature 0.0 --device mps --seed 0

	# 3-bit (recommended default)
	python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \
	--max-tokens 128 --temperature 0.0 --device mps --seed 0

	# 4-bit
	python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \
	--max-tokens 160 --temperature 0.1 --device mps --seed 0

	# 5-bit
	python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \
	--max-tokens 192 --temperature 0.1 --device mps --seed 0

	# 6-bit
	python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \
	--prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \
	--max-tokens 192 --temperature 0.0 --device mps --seed 0