Qwen3 Split Core ML Deployment

This is a Qwen3-1.7B-shaped decoder (28 layers, GQA 16/8 heads, tied embedding, RoPE 1e6, vocab 151,936) re-authored as Core ML MLPrograms split into IO + left/mid/right shards + an on-device sampler. It is not a vanilla Transformers checkpoint; the value is in the ANE-friendly deployment stack: multi-MLProgram state topology, reversed KV ring buffers, mixed quantization (OmniQuant embedding/head + GS128 LUT projections), safe RMSNorm, and a sampler MLProgram with penalty/temperature/Möbius/noise.

Why this exists (industry context)

  • Datacenter stacks (Google TPU Ironwood, NVIDIA Hopper/Blackwell) chase higher throughput via ever-lower precision (FP8/FP4/INTx, palettization, sparsity) while preserving semantic quality for chat/code/tooling.
  • Apple’s ANE is already low-precision and tightly constrained, but Core ML exposes a generic graph executor—no TPU-style kernels or dtypes—so the model itself must absorb the hardware constraints.
  • This repo shows the on-device analogue of that trend: GS128 LUT projections + OmniQuant embeddings + SLANC scales + split graphs + strict KV topology to make a 1.7B Qwen-class decoder behave “datacenter-precise” on phone-class hardware without new ops or dtypes.

What’s here

  • llm_io.mil, llm_left.mil, llm_mid.mil, llm_right.mil: MIL graphs for each shard.
  • scripts/: builders for each shard + sampler, SLANC scale generation, Python runner.
  • PAPER.md: system design and rationale.
  • LICENSE: Apache 2.0 license for this repository’s code and build scripts.
  • models/: placeholder for compiled .mlpackage outputs. Currently contains upstream Qwen3-1.7B references for local quantization only—include them on HF only if you comply with upstream license and attribution.
  • weights/: placeholder for quantized weight artifacts.

What you must supply (not bundled)

  • Compiled MLPrograms (llm_io/left/mid/right/sampler.mlpackage), optionally zipped for HF upload. This repo only contains MIL and build scripts.
  • Tokenizer compatible with vocab 151,936 (e.g., from Qwen/Qwen3-1.7B) and its chat template (see Tokenizer & chat template below).
  • Upstream config/weights (config.json, model.safetensors) for the exact Qwen3 variant. These remain under the upstream license (e.g., Apache 2.0 for Qwen/Qwen3-1.7B); follow that license if you redistribute them.
  • Quantized packs: OmniQuant embedding/LM head outputs and GS128 palettized projection packs (metadata_coreml_pg.json[.gz] + LUT/indices) plus slanc_scales.npy.

Prereqs

  • Python 3.10+ on macOS with Core ML toolchain.
  • Packages: coremltools (iOS 18 target), torch, transformers, tokenizers, safetensors, numpy.
  • A device or simulator that can load iOS 18+ mlprograms if you intend to run on-device.

Getting started (env + quick smoke test)

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install coremltools torch transformers tokenizers safetensors numpy

# build (or download) mlpackages, then run a short prompt
python scripts/inference.py \
  --models-dir models/mlpackage \
  --tokenizer models/Qwen3-1.7B/tokenizer.json \
  --prompt "Hello" \
  --max-new-tokens 8 \
  --prefill-progress 4

Differences vs vanilla Qwen3

  • Multi-MLProgram partitioning with per-shard KV states to stay within ANE provisioning limits.
  • Reversed ring-buffer KV layout + scatter-free mask blending for updates; static causal masks.
  • Mixed quantization regimes: OmniQuant blockwise for embedding/LM head (weight tying kept), GS128 grouped LUT for Q/K/V/O and MLP with per-group scalars; SLANC pre-scales + safe RMSNorm for fp16 stability.
  • Attention expressed as explicit per-head Core ML ops (GQA mapping via integer arithmetic) instead of fused kernels.
  • Sampler is its own MLProgram (top-k=1, repetition penalty, temperature, Levy noise, Möbius modulation) to keep the loop on-device.

Build order

  1. SLANC scales (can point to the same LUT dir used later):
python scripts/slanc_lut.py \
  --config ./config.json \
  --weights-glob "./model.safetensors" \
  --lut-dir ./qwen_mixed \
  --output slanc_scales.npy
  1. IO model (embed/logits, conditional):
python scripts/build_model_split_io.py \
  --config ./config.json \
  --omniq-dir ./out_ft_int4 \
  --output models/mlpackage/llm_io.mlpackage
  1. Decoder shards (hidden size/seq length pulled from config.json; override with --seq-len if needed):
python scripts/build_model_split_left.py  --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_left.mlpackage
python scripts/build_model_split_mid.py   --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_mid.mlpackage
python scripts/build_model_split_right.py --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_right.mlpackage
  1. Sampler (top-k=1, fp16 temp/penalty/möbius):
python scripts/build_sampler.py  # writes llm_sampler.mlpackage
  1. Smoke test (requires tokenizer JSON and built packages):
python scripts/inference.py \
  --models-dir models/mlpackage \
  --tokenizer path/to/tokenizer.json \
  --prompt "Hello" \
  --max-new-tokens 16 \
  --prefill-progress 10 \
  --stats

Tokenizer & chat template

  • Tokenizer: use the upstream tokenizer.json from Qwen/Qwen3-1.7B (151,936 vocab). Point scripts/inference.py to that path or bundle a copy at tokenizer.json in your HF repo.
  • Chat template (ChatML-style) expected by the runner:
    <|im_start|>system
    {system}
    <|im_end|>
    <|im_start|>user
    {user}
    <|im_end|>
    <|im_start|>assistant
    
    Provide system as a short instruction (“You are a helpful assistant.”) and user as the prompt.

HF usage snippet (download + inference)

from huggingface_hub import hf_hub_download
from pathlib import Path

# repo_id for your HF repo (public or gated, provided you comply with upstream licenses).
repo_id = "your-org/qwen3-coreml"
models_dir = Path("./models_cache")
models_dir.mkdir(exist_ok=True)

for name in ["llm_io", "llm_left", "llm_mid", "llm_right", "llm_sampler"]:
    # if you upload zipped bundles, set filename=f"{name}.mlpackage.zip" and unzip here
    hf_hub_download(repo_id=repo_id, filename=f"{name}.mlpackage", repo_type="model", local_dir=models_dir)

tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", repo_type="model")

# Run the provided runner (prefers ANE for blocks and GPU for IO/sampler)
import subprocess
subprocess.run([
    "python", "scripts/inference.py",
    "--models-dir", str(models_dir),
    "--tokenizer", tokenizer_path,
    "--prompt", "Hello",
    "--max-new-tokens", "16",
    "--prefill-progress", "10",
])

Notes

  • llm_io has a mode input (0 = embedding path, non-zero = logits) so the runner avoids redundant matmul work.
  • Builder scripts accept --config/--weights/--lut-dir/--scales/--output (and --seq-len for context length) instead of fixed paths; defaults match the original layout.
  • Sampler inputs are fp16 for penalty/temp/mobius_strength to match the MLProgram input spec.
  • Batch size is fixed at 1; context window is set by the builders (--seq-len), and shard ranges are fixed to layers 0–10 / 11–19 / 20–27.

Hugging Face hosting checklist

  • Repo may be public under the Apache 2.0 license for this code. Respect upstream licenses for any redistributed weights/tokenizers (e.g., Qwen is Apache 2.0) and carry over their LICENSE/NOTICE when publishing.
  • Upload the compiled .mlpackage artifacts (or zipped bundles) for llm_io, llm_left_pal, llm_mid_pal, llm_right_pal, and llm_sampler plus tokenizer.json. Include upstream config.json/model.safetensors only if allowed by their license and your policy; bundle upstream LICENSE/NOTICE alongside them.
  • Keep this README.md front matter; HF will render it as the model card (with license: apache-2.0).
  • Add .gitattributes rules (see repository root) so .mlpackage/**, *.mlpackage.zip, *.zip, *.safetensors, *.bin, *.idx*, *.lut_scalar, *.npy stay on LFS/Xet.
  • Provide download instructions (huggingface-cli download or hf_hub_download) and your chosen chat template/tokenizer path.
  • The left/mid/right shards below are palettized (KV mask compression) to stay within ≈320 MiB per shard; IO and sampler remain unchanged.

Download (palettized shards)

from huggingface_hub import hf_hub_download
from pathlib import Path

repo_id = "your-org/qwen3-coreml-pal"
dst = Path("models_pal")
dst.mkdir(exist_ok=True)

for name in ["llm_io", "llm_left_pal", "llm_mid_pal", "llm_right_pal", "llm_sampler"]:
    hf_hub_download(repo_id=repo_id, filename=f"{name}.mlpackage", repo_type="model", local_dir=dst)

tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", repo_type="model", local_dir=dst)

The provided runner auto-resolves _pal suffixes, so you can point --models-dir at the download directory.

File manifest (current build)

Artifact Path in repo Notes SHA256 Size (bytes / approx)
IO shard llm_io.mlpackage Embedding + logits f3cd5f11d032418cb8a41f04fd827947d304029a93ede8458cbc1ed2d39900dc 194,791,328 (≈186 MB)
Left shard (palettized) llm_left_pal.mlpackage Layers 0–10, mask palettized 7fa4015560fdf28c3d77263214873ac0586aee5d2efa8c012cdcabb28c98b0b7 330,017,715 (≈315 MiB)
Mid shard (palettized) llm_mid_pal.mlpackage Layers 11–19, mask palettized 31aa8b9e16bf71a611d3460eeecef842bc15db014c384dfbba4290412c917bdc 338,366,286 (≈323 MiB)
Right shard (palettized) llm_right_pal.mlpackage Layers 20–27 + final norm, mask palettized a8c0657a8ddad4bad4ceae86c583bb5af45ea71f77398a20fae7fd7a3c3afa8c 334,256,382 (≈319 MiB)
Sampler llm_sampler.mlpackage Temperature/penalty/Möbius/Levy noise 64f303bf0d1cd45f8ce573836cae5b566955090b661343d579f60a9e2e4ced2e 671,585 (≈0.64 MB)
Tokenizer tokenizer.json From Qwen/Qwen3-1.7B use upstream file

If you rebuild or zip the packages, recompute the hashes/sizes and update the table; keep zipped artifacts on LFS as well.

Evaluation / results

  • Pending: log perplexity on a small WikiText subset and latency on A17/Ultra at seq 128/512. Record the numbers here and in model_index.json for HF rendering.

Licensing

The repository’s code and docs are under the Apache License 2.0 (see LICENSE).

Upstream assets (weights, tokenizer, config) remain under their own licenses (e.g., Apache 2.0 for Qwen/Qwen3). If you redistribute them—on HF or elsewhere—include the upstream LICENSE/NOTICE and follow any attribution/terms specified there.

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pkhairkh/qwen3-coreml-palettized

Finetuned
Qwen/Qwen3-1.7B
Quantized
(137)
this model