--- license: mit language: - en tags: - mlx - apple-silicon - multimodal - vision-language - pixtral - llava - quantized - 3bit - 4bit - 5bit - 6bit pipeline_tag: image-text-to-text library_name: mlx base_model: - ServiceNow-AI/Apriel-1.5-15b-Thinker --- # Apriel-1.5-15B-Thinker — **MLX 3-bit** (Apple Silicon) **Format:** MLX (Mac, Apple Silicon) **Quantization:** **3-bit** (balanced footprint ↔ quality) **Base:** ServiceNow-AI/Apriel-1.5-15B-Thinker **Architecture:** Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder) This repository provides a **3-bit MLX** build of Apriel-1.5-15B-Thinker for **on-device** multimodal inference on Apple Silicon. In side-by-side tests, the **3-bit** variant often: - uses **significantly less RAM** than 6-bit, - decodes **faster**, and - tends to produce **more direct answers** (less “thinking out loud”) at low temperature. If RAM allows, we also suggest trying **4-bit/5-bit/6-bit** variants (guidance below) for tasks that demand more fidelity. > Explore other Apriel MLX variants under the `mlx-community` namespace on the Hub. --- ## 🔎 Upstream → MLX summary Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via **depth upscaling**, **two-stage multimodal continual pretraining**, and **SFT with explicit reasoning traces** (math, coding, science, tool-use). This MLX release converts the upstream checkpoint with **3-bit** quantization for smaller memory and quick startup on macOS. --- ## 📦 Contents - `config.json` (MLX config for Pixtral-style VLM) - `mlx_model*.safetensors` (3-bit shards) - `tokenizer.json`, `tokenizer_config.json` - `processor_config.json` / `image_processor.json` - `model_index.json` and metadata --- ## 🚀 Quickstart (CLI) **Single image caption** ```python python -m mlx_vlm.generate \ --model \ --image /path/to/image.jpg \ --prompt "Describe this image in two concise sentences." \ --max-tokens 128 --temperature 0.0 --device mps --seed 0 ``` ## 🔀 Model Family Comparison (2-bit → 6-bit) > **TL;DR:** Start with **3-bit** for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to **4-bit/5-bit**. Use **6-bit** only when you have headroom and you explicitly instruct concision. ### 📊 Quick Comparison | Variant | 🧠 Peak RAM\* | ⚡ Speed (rel.) | 🗣️ Output Style (typical) | ✅ Best For | ⚠️ Watch Out For | |---|---:|:---:|---|---|---| | **2-bit** | ~7–8 GB | 🔥🔥🔥🔥 | Shortest, most lossy | Minimal RAM demos, quick triage | Detail loss on OCR/dense charts; more omissions | | **3-bit** | **~9–10 GB** | **🔥🔥🔥🔥** | **Direct, concise** | Default on M1/M2/M3; day-to-day use | May miss tiny text; keep prompts precise | | **4-bit** | ~11–12.5 GB | 🔥🔥🔥 | More detail retained | Docs/UIs with small text; charts | Slightly slower; still quantization artifacts | | **5-bit** | ~13–14 GB | 🔥🔥☆ | Higher fidelity | Heavier document/diagram tasks | Needs more RAM; occasional verbose answers | | **6-bit** | ~14.5–16 GB | 🔥🔥 | Highest MLX fidelity | Max quality under quant | Can “think aloud”; add *be concise* instruction | \*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length. --- ### 🧪 Example (COCO `000000039769.jpg` — “two cats on a pink couch”) | Variant | ⏱️ Prompt TPS | ⏱️ Gen TPS | 📈 Peak RAM | 📝 Notes | |---|---:|---:|---:|---| | **3-bit** | ~79 tok/s | **~9.79 tok/s** | **~9.57 GB** | Direct answer; minimal “reasoning” leakage | | **6-bit** | ~78 tok/s | ~6.50 tok/s | ~14.81 GB | Sometimes prints “Here are my reasoning steps…” | > Settings: `--temperature 0.0 --max-tokens 100 --device mps`. Results vary by Mac model and image resolution; trend is consistent. --- ### 🧭 Choosing the Right Precision - **I just want it to work on my Mac:** 👉 **3-bit** - **Tiny fonts / invoices / UI text matter:** 👉 **4-bit**, then **5-bit** if RAM allows - **I need every drop of quality and have ≥16 GB free:** 👉 **6-bit** (add *“Answer directly; do not include reasoning.”*) - **I have very little RAM:** 👉 **2-bit** (expect noticeable quality loss) --- ### ⚙️ Suggested Settings (per variant) | Variant | Max Tokens | Temp | Seed | Notes | |---|---:|---:|---:|---| | **2-bit** | 64–96 | 0.0 | 0 | Keep short; single image; expect omissions | | **3-bit** | 96–128 | 0.0 | 0 | Great default; concise prompts help | | **4-bit** | 128–192 | 0.0–0.2 | 0 | Better small-text recall; watch RAM | | **5-bit** | 128–256 | 0.0–0.2 | 0 | Highest OCR among quantized tiers pre-6b | | **6-bit** | 128–256 | 0.0 | 0 | Add anti-CoT phrasing (see below) | **Anti-CoT prompt add-on (any bit-width):** > *“Answer directly. Do **not** include your reasoning steps.”* (Optional) Add a stop string if your stack supports it (e.g., stop at `"\nHere are my reasoning steps:"`). --- ### 🛠️ One-liners (swap model IDs) ```bash # 2-bit python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \ --max-tokens 96 --temperature 0.0 --device mps --seed 0 # 3-bit (recommended default) python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \ --max-tokens 128 --temperature 0.0 --device mps --seed 0 # 4-bit python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \ --max-tokens 160 --temperature 0.1 --device mps --seed 0 # 5-bit python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \ --max-tokens 192 --temperature 0.1 --device mps --seed 0 # 6-bit python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \ --prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \ --max-tokens 192 --temperature 0.0 --device mps --seed 0