Qwen3 Split Core ML Deployment

This is a Qwen3-1.7B-shaped decoder (28 layers, GQA 16/8 heads, tied embedding, RoPE 1e6, vocab 151,936) re-authored as Core ML MLPrograms split into IO + left/mid/right shards + an on-device sampler. It is not a vanilla Transformers checkpoint; the value is in the ANE-friendly deployment stack: multi-MLProgram state topology, reversed KV ring buffers, mixed quantization (OmniQuant embedding/head + GS128 LUT projections), safe RMSNorm, and a sampler MLProgram with penalty/temperature/Möbius/noise.

Why this exists (industry context)

Datacenter stacks (Google TPU Ironwood, NVIDIA Hopper/Blackwell) chase higher throughput via ever-lower precision (FP8/FP4/INTx, palettization, sparsity) while preserving semantic quality for chat/code/tooling.
Apple’s ANE is already low-precision and tightly constrained, but Core ML exposes a generic graph executor—no TPU-style kernels or dtypes—so the model itself must absorb the hardware constraints.
This repo shows the on-device analogue of that trend: GS128 LUT projections + OmniQuant embeddings + SLANC scales + split graphs + strict KV topology to make a 1.7B Qwen-class decoder behave “datacenter-precise” on phone-class hardware without new ops or dtypes.

What’s here

llm_io.mil, llm_left.mil, llm_mid.mil, llm_right.mil: MIL graphs for each shard.
scripts/: builders for each shard + sampler, SLANC scale generation, Python runner.
PAPER.md: system design and rationale.
LICENSE: Apache 2.0 license for this repository’s code and build scripts.
models/: placeholder for compiled .mlpackage outputs. Currently contains upstream Qwen3-1.7B references for local quantization only—include them on HF only if you comply with upstream license and attribution.
weights/: placeholder for quantized weight artifacts.

What you must supply (not bundled)

Compiled MLPrograms (llm_io/left/mid/right/sampler.mlpackage), optionally zipped for HF upload. This repo only contains MIL and build scripts.
Tokenizer compatible with vocab 151,936 (e.g., from Qwen/Qwen3-1.7B) and its chat template (see Tokenizer & chat template below).
Upstream config/weights (config.json, model.safetensors) for the exact Qwen3 variant. These remain under the upstream license (e.g., Apache 2.0 for Qwen/Qwen3-1.7B); follow that license if you redistribute them.
Quantized packs: OmniQuant embedding/LM head outputs and GS128 palettized projection packs (metadata_coreml_pg.json[.gz] + LUT/indices) plus slanc_scales.npy.

Prereqs

Python 3.10+ on macOS with Core ML toolchain.
Packages: coremltools (iOS 18 target), torch, transformers, tokenizers, safetensors, numpy.
A device or simulator that can load iOS 18+ mlprograms if you intend to run on-device.

Getting started (env + quick smoke test)

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install coremltools torch transformers tokenizers safetensors numpy

# build (or download) mlpackages, then run a short prompt
python scripts/inference.py \
  --models-dir models/mlpackage \
  --tokenizer models/Qwen3-1.7B/tokenizer.json \
  --prompt "Hello" \
  --max-new-tokens 8 \
  --prefill-progress 4

Differences vs vanilla Qwen3

Multi-MLProgram partitioning with per-shard KV states to stay within ANE provisioning limits.
Reversed ring-buffer KV layout + scatter-free mask blending for updates; static causal masks.
Mixed quantization regimes: OmniQuant blockwise for embedding/LM head (weight tying kept), GS128 grouped LUT for Q/K/V/O and MLP with per-group scalars; SLANC pre-scales + safe RMSNorm for fp16 stability.
Attention expressed as explicit per-head Core ML ops (GQA mapping via integer arithmetic) instead of fused kernels.
Sampler is its own MLProgram (top-k=1, repetition penalty, temperature, Levy noise, Möbius modulation) to keep the loop on-device.

Build order

SLANC scales (can point to the same LUT dir used later):

python scripts/slanc_lut.py \
  --config ./config.json \
  --weights-glob "./model.safetensors" \
  --lut-dir ./qwen_mixed \
  --output slanc_scales.npy

IO model (embed/logits, conditional):

python scripts/build_model_split_io.py \
  --config ./config.json \
  --omniq-dir ./out_ft_int4 \
  --output models/mlpackage/llm_io.mlpackage

Decoder shards (hidden size/seq length pulled from config.json; override with --seq-len if needed):

python scripts/build_model_split_left.py  --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_left.mlpackage
python scripts/build_model_split_mid.py   --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_mid.mlpackage
python scripts/build_model_split_right.py --config ./config.json --weights ./model.safetensors --lut-dir ./qwen_mixed --scales slanc_scales.npy --output models/mlpackage/llm_right.mlpackage

Sampler (top-k=1, fp16 temp/penalty/möbius):

python scripts/build_sampler.py  # writes llm_sampler.mlpackage

Smoke test (requires tokenizer JSON and built packages):

python scripts/inference.py \
  --models-dir models/mlpackage \
  --tokenizer path/to/tokenizer.json \
  --prompt "Hello" \
  --max-new-tokens 16 \
  --prefill-progress 10 \
  --stats

Tokenizer & chat template

Tokenizer: use the upstream tokenizer.json from Qwen/Qwen3-1.7B (151,936 vocab). Point scripts/inference.py to that path or bundle a copy at tokenizer.json in your HF repo.
Chat template (ChatML-style) expected by the runner:
```
<|im_start|>system
{system}
<|im_end|>
<|im_start|>user
{user}
<|im_end|>
<|im_start|>assistant
```
Provide system as a short instruction (“You are a helpful assistant.”) and user as the prompt.

HF usage snippet (download + inference)

from huggingface_hub import hf_hub_download
from pathlib import Path

# repo_id for your HF repo (public or gated, provided you comply with upstream licenses).
repo_id = "your-org/qwen3-coreml"
models_dir = Path("./models_cache")
models_dir.mkdir(exist_ok=True)

for name in ["llm_io", "llm_left", "llm_mid", "llm_right", "llm_sampler"]:
    # if you upload zipped bundles, set filename=f"{name}.mlpackage.zip" and unzip here
    hf_hub_download(repo_id=repo_id, filename=f"{name}.mlpackage", repo_type="model", local_dir=models_dir)

tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", repo_type="model")

# Run the provided runner (prefers ANE for blocks and GPU for IO/sampler)
import subprocess
subprocess.run([
    "python", "scripts/inference.py",
    "--models-dir", str(models_dir),
    "--tokenizer", tokenizer_path,
    "--prompt", "Hello",
    "--max-new-tokens", "16",
    "--prefill-progress", "10",
])

Notes

llm_io has a mode input (0 = embedding path, non-zero = logits) so the runner avoids redundant matmul work.
Builder scripts accept --config/--weights/--lut-dir/--scales/--output (and --seq-len for context length) instead of fixed paths; defaults match the original layout.
Sampler inputs are fp16 for penalty/temp/mobius_strength to match the MLProgram input spec.
Batch size is fixed at 1; context window is set by the builders (--seq-len), and shard ranges are fixed to layers 0–10 / 11–19 / 20–27.

Hugging Face hosting checklist

Repo may be public under the Apache 2.0 license for this code. Respect upstream licenses for any redistributed weights/tokenizers (e.g., Qwen is Apache 2.0) and carry over their LICENSE/NOTICE when publishing.
Upload the compiled .mlpackage artifacts (or zipped bundles) for llm_io, llm_left_pal, llm_mid_pal, llm_right_pal, and llm_sampler plus tokenizer.json. Include upstream config.json/model.safetensors only if allowed by their license and your policy; bundle upstream LICENSE/NOTICE alongside them.
Keep this README.md front matter; HF will render it as the model card (with license: apache-2.0).
Add .gitattributes rules (see repository root) so .mlpackage/**, *.mlpackage.zip, *.zip, *.safetensors, *.bin, *.idx*, *.lut_scalar, *.npy stay on LFS/Xet.
Provide download instructions (huggingface-cli download or hf_hub_download) and your chosen chat template/tokenizer path.
The left/mid/right shards below are palettized (KV mask compression) to stay within ≈320 MiB per shard; IO and sampler remain unchanged.

Download (palettized shards)

from huggingface_hub import hf_hub_download
from pathlib import Path

repo_id = "your-org/qwen3-coreml-pal"
dst = Path("models_pal")
dst.mkdir(exist_ok=True)

for name in ["llm_io", "llm_left_pal", "llm_mid_pal", "llm_right_pal", "llm_sampler"]:
    hf_hub_download(repo_id=repo_id, filename=f"{name}.mlpackage", repo_type="model", local_dir=dst)

tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", repo_type="model", local_dir=dst)

The provided runner auto-resolves _pal suffixes, so you can point --models-dir at the download directory.

File manifest (current build)

Artifact	Path in repo	Notes	SHA256	Size (bytes / approx)
IO shard	`llm_io.mlpackage`	Embedding + logits	f3cd5f11d032418cb8a41f04fd827947d304029a93ede8458cbc1ed2d39900dc	194,791,328 (≈186 MB)
Left shard (palettized)	`llm_left_pal.mlpackage`	Layers 0–10, mask palettized	7fa4015560fdf28c3d77263214873ac0586aee5d2efa8c012cdcabb28c98b0b7	330,017,715 (≈315 MiB)
Mid shard (palettized)	`llm_mid_pal.mlpackage`	Layers 11–19, mask palettized	31aa8b9e16bf71a611d3460eeecef842bc15db014c384dfbba4290412c917bdc	338,366,286 (≈323 MiB)
Right shard (palettized)	`llm_right_pal.mlpackage`	Layers 20–27 + final norm, mask palettized	a8c0657a8ddad4bad4ceae86c583bb5af45ea71f77398a20fae7fd7a3c3afa8c	334,256,382 (≈319 MiB)
Sampler	`llm_sampler.mlpackage`	Temperature/penalty/Möbius/Levy noise	64f303bf0d1cd45f8ce573836cae5b566955090b661343d579f60a9e2e4ced2e	671,585 (≈0.64 MB)
Tokenizer	`tokenizer.json`	From `Qwen/Qwen3-1.7B`	use upstream file

If you rebuild or zip the packages, recompute the hashes/sizes and update the table; keep zipped artifacts on LFS as well.

Evaluation / results

Pending: log perplexity on a small WikiText subset and latency on A17/Ultra at seq 128/512. Record the numbers here and in model_index.json for HF rendering.

Licensing

The repository’s code and docs are under the Apache License 2.0 (see LICENSE).

Upstream assets (weights, tokenizer, config) remain under their own licenses (e.g., Apache 2.0 for Qwen/Qwen3). If you redistribute them—on HF or elsewhere—include the upstream LICENSE/NOTICE and follow any attribution/terms specified there.

Downloads last month: 55

Model tree for pkhairkh/qwen3-coreml-palettized

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(137)

this model