Qwen3 Split Core ML Deployment
This is a Qwen3-1.7B-shaped decoder (28 layers, GQA 16/8 heads, tied embedding, RoPE 1e6, vocab 151,936) re-authored as Core ML MLPrograms split into IO + left/mid/right shards + an on-device sampler. It is not a vanilla Transformers checkpoint; the value is in the ANE-friendly deployment stack: multi-MLProgram state topology, reversed KV ring buffers, mixed quantization (OmniQuant embedding/head + GS128 LUT projections), safe RMSNorm, and a sampler MLProgram with penalty/temperature/Möbius/noise.
Why this exists (industry context)
- Datacenter stacks (Google TPU Ironwood, NVIDIA Hopper/Blackwell) chase higher throughput via ever-lower precision (FP8/FP4/INTx, palettization, sparsity) while preserving semantic quality for chat/code/tooling.
- Apple’s ANE is already low-precision and tightly constrained, but Core ML exposes a generic graph executor—no TPU-style kernels or dtypes—so the model itself must absorb the hardware constraints.
- This repo shows the on-device analogue of that trend: GS128 LUT projections + OmniQuant embeddings + SLANC scales + split graphs + strict KV topology to make a 1.7B Qwen-class decoder behave “datacenter-precise” on phone-class hardware without new ops or dtypes.
What’s here
llm_io.mil,llm_left.mil,llm_mid.mil,llm_right.mil: MIL graphs for each shard.scripts/: builders for each shard + sampler, SLANC scale generation, Python runner.PAPER.md: system design and rationale.LICENSE: Apache 2.0 license for this repository’s code and build scripts.models/: placeholder for compiled.mlpackageoutputs. Currently contains upstreamQwen3-1.7Breferences for local quantization only—include them on HF only if you comply with upstream license and attribution.weights/: placeholder for quantized weight artifacts.
What you must supply (not bundled)
- Compiled MLPrograms (
llm_io/left/mid/right/sampler.mlpackage), optionally zipped for HF upload. This repo only contains MIL and build scripts. - Tokenizer compatible with vocab 151,936 (e.g., from
Qwen/Qwen3-1.7B) and its chat template (see Tokenizer & chat template below). - Upstream config/weights (
config.json,model.safetensors) for the exact Qwen3 variant. These remain under the upstream license (e.g., Apache 2.0 forQwen/Qwen3-1.7B); follow that license if you redistribute them. - Quantized packs: OmniQuant embedding/LM head outputs and GS128 palettized projection packs (
metadata_coreml_pg.json[.gz]+ LUT/indices) plusslanc_scales.npy.
Prereqs
- Python 3.10+ on macOS with Core ML toolchain.
- Packages:
coremltools(iOS 18 target),torch,transformers,tokenizers,safetensors,numpy. - A device or simulator that can load iOS 18+
mlprograms if you intend to run on-device.
Getting started (env + quick smoke test)
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install coremltools torch transformers tokenizers safetensors numpy
# build (or download) mlpackages, then run a short prompt
python scripts/inference.py \
--models-dir models/mlpackage \
--tokenizer models/Qwen3-1.7B/tokenizer.json \
--prompt "Hello" \
--max-new-tokens 8 \
--prefill-progress 4
Differences vs vanilla Qwen3
- Multi-MLProgram partitioning with per-shard KV states to stay within ANE provisioning limits.
- Reversed ring-buffer KV layout + scatter-free mask blending for updates; static causal masks.
- Mixed quantization regimes: OmniQuant blockwise for embedding/LM head (weight tying kept), GS128 grouped LUT for Q/K/V/O and MLP with per-group scalars; SLANC pre-scales + safe RMSNorm for fp16 stability.
- Attention expressed as explicit per-head Core ML ops (GQA mapping via integer arithmetic) instead of fused kernels.
- Sampler is its own MLProgram (top-k=1, repetition penalty, temperature, Levy noise, Möbius modulation) to keep the loop on-device.
Build order
- SLANC scales (can point to the same LUT dir used later):
python scripts/slanc_lut.py \
--config ./config.json \
--weights-glob "./model.safetensors" \
--lut-dir ./qwen_mixed \
--output slanc_scales.npy
- IO model (embed/logits, conditional):
python scripts/build_model_split_io.py \
--config ./config.json \
--omniq-dir ./out_ft_int4 \
--output models/mlpackage/llm_io.mlpackage
- Decoder shards (hidden size/seq length pulled from
config.json; override with--seq-lenif needed):
python scripts/build_model_split_left.py --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_left.mlpackage
python scripts/build_model_split_mid.py --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_mid.mlpackage
python scripts/build_model_split_right.py --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_right.mlpackage
- Sampler (top-k=1, fp16 temp/penalty/möbius):
python scripts/build_sampler.py # writes llm_sampler.mlpackage
- Smoke test (requires tokenizer JSON and built packages):
python scripts/inference.py \
--models-dir models/mlpackage \
--tokenizer path/to/tokenizer.json \
--prompt "Hello" \
--max-new-tokens 16 \
--prefill-progress 10 \
--stats
Tokenizer & chat template
- Tokenizer: use the upstream
tokenizer.jsonfromQwen/Qwen3-1.7B(151,936 vocab). Pointscripts/inference.pyto that path or bundle a copy attokenizer.jsonin your HF repo. - Chat template (ChatML-style) expected by the runner:
Provide<|im_start|>system {system} <|im_end|> <|im_start|>user {user} <|im_end|> <|im_start|>assistantsystemas a short instruction (“You are a helpful assistant.”) anduseras the prompt.
HF usage snippet (download + inference)
from huggingface_hub import hf_hub_download
from pathlib import Path
# repo_id for your HF repo (public or gated, provided you comply with upstream licenses).
repo_id = "your-org/qwen3-coreml"
models_dir = Path("./models_cache")
models_dir.mkdir(exist_ok=True)
for name in ["llm_io", "llm_left", "llm_mid", "llm_right", "llm_sampler"]:
# if you upload zipped bundles, set filename=f"{name}.mlpackage.zip" and unzip here
hf_hub_download(repo_id=repo_id, filename=f"{name}.mlpackage", repo_type="model", local_dir=models_dir)
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", repo_type="model")
# Run the provided runner (prefers ANE for blocks and GPU for IO/sampler)
import subprocess
subprocess.run([
"python", "scripts/inference.py",
"--models-dir", str(models_dir),
"--tokenizer", tokenizer_path,
"--prompt", "Hello",
"--max-new-tokens", "16",
"--prefill-progress", "10",
])
Notes
llm_iohas amodeinput (0 = embedding path, non-zero = logits) so the runner avoids redundant matmul work.- Builder scripts accept
--config/--weights/--lut-dir/--scales/--output(and--seq-lenfor context length) instead of fixed paths; defaults match the original layout. - Sampler inputs are fp16 for
penalty/temp/mobius_strengthto match the MLProgram input spec. - Batch size is fixed at 1; context window is set by the builders (
--seq-len), and shard ranges are fixed to layers 0–10 / 11–19 / 20–27.
Hugging Face hosting checklist
- Repo may be public under the Apache 2.0 license for this code. Respect upstream licenses for any redistributed weights/tokenizers (e.g., Qwen is Apache 2.0) and carry over their LICENSE/NOTICE when publishing.
- Upload the compiled
.mlpackageartifacts (or zipped bundles) forllm_io,llm_left_pal,llm_mid_pal,llm_right_pal, andllm_samplerplustokenizer.json. Include upstreamconfig.json/model.safetensorsonly if allowed by their license and your policy; bundle upstreamLICENSE/NOTICEalongside them. - Keep this
README.mdfront matter; HF will render it as the model card (withlicense: apache-2.0). - Add
.gitattributesrules (see repository root) so.mlpackage/**,*.mlpackage.zip,*.zip,*.safetensors,*.bin,*.idx*,*.lut_scalar,*.npystay on LFS/Xet. - Provide download instructions (
huggingface-cli downloadorhf_hub_download) and your chosen chat template/tokenizer path. - The left/mid/right shards below are palettized (KV mask compression) to stay within ≈320 MiB per shard; IO and sampler remain unchanged.
Download (palettized shards)
from huggingface_hub import hf_hub_download
from pathlib import Path
repo_id = "your-org/qwen3-coreml-pal"
dst = Path("models_pal")
dst.mkdir(exist_ok=True)
for name in ["llm_io", "llm_left_pal", "llm_mid_pal", "llm_right_pal", "llm_sampler"]:
hf_hub_download(repo_id=repo_id, filename=f"{name}.mlpackage", repo_type="model", local_dir=dst)
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", repo_type="model", local_dir=dst)
The provided runner auto-resolves _pal suffixes, so you can point --models-dir at the download directory.
File manifest (current build)
| Artifact | Path in repo | Notes | SHA256 | Size (bytes / approx) |
|---|---|---|---|---|
| IO shard | llm_io.mlpackage |
Embedding + logits | f3cd5f11d032418cb8a41f04fd827947d304029a93ede8458cbc1ed2d39900dc | 194,791,328 (≈186 MB) |
| Left shard (palettized) | llm_left_pal.mlpackage |
Layers 0–10, mask palettized | 7fa4015560fdf28c3d77263214873ac0586aee5d2efa8c012cdcabb28c98b0b7 | 330,017,715 (≈315 MiB) |
| Mid shard (palettized) | llm_mid_pal.mlpackage |
Layers 11–19, mask palettized | 31aa8b9e16bf71a611d3460eeecef842bc15db014c384dfbba4290412c917bdc | 338,366,286 (≈323 MiB) |
| Right shard (palettized) | llm_right_pal.mlpackage |
Layers 20–27 + final norm, mask palettized | a8c0657a8ddad4bad4ceae86c583bb5af45ea71f77398a20fae7fd7a3c3afa8c | 334,256,382 (≈319 MiB) |
| Sampler | llm_sampler.mlpackage |
Temperature/penalty/Möbius/Levy noise | 64f303bf0d1cd45f8ce573836cae5b566955090b661343d579f60a9e2e4ced2e | 671,585 (≈0.64 MB) |
| Tokenizer | tokenizer.json |
From Qwen/Qwen3-1.7B |
use upstream file |
If you rebuild or zip the packages, recompute the hashes/sizes and update the table; keep zipped artifacts on LFS as well.
Evaluation / results
- Pending: log perplexity on a small WikiText subset and latency on A17/Ultra at seq 128/512. Record the numbers here and in
model_index.jsonfor HF rendering.
Licensing
The repository’s code and docs are under the Apache License 2.0 (see LICENSE).
Upstream assets (weights, tokenizer, config) remain under their own licenses (e.g., Apache 2.0 for Qwen/Qwen3). If you redistribute them—on HF or elsewhere—include the upstream LICENSE/NOTICE and follow any attribution/terms specified there.
- Downloads last month
- 55