Omni-ASR CTC CoreML Models
CoreML-optimized versions of Meta's Omni-ASR CTC models for on-device speech recognition on Apple platforms (iOS 17+, macOS 14+).
These models run entirely on-device using Apple's Neural Engine (ANE), with no cloud dependency.
Available Models
| Model | Parameters | Precision | Size | Recommended |
|---|---|---|---|---|
OmniASR_CTC_300M_int8 |
300M | INT8 | 312 MB | Yes |
OmniASR_CTC_300M_fp16 |
300M | FP16 | 621 MB | |
OmniASR_CTC_1B_int8 |
1B | INT8 | 933 MB | |
OmniASR_CTC_1B_fp16 |
1B | FP16 | 1.8 GB |
The 300M INT8 variant offers the best trade-off between accuracy and latency for real-time use on iPhone.
Architecture
- Backbone: wav2vec2 Conformer encoder (fairseq2)
- Head: CTC (Connectionist Temporal Classification)
- Feature extractor: Convolutional, stride 320 (20ms per frame at 16kHz)
- Vocabulary: 9,813 multilingual SentencePiece tokens (shared across all variants)
- Training: Dynamic Chunk Training with ~10% full-context passes
Input / Output
| Description | |
|---|---|
| Input | audio: Float16 MultiArray [1, T] β raw 16kHz mono audio samples |
| Output | logits: Float16 MultiArray [1, T/320, 9813] β CTC log-probabilities |
Supported input lengths (enumerated shapes):
[1, 160000]β 10 seconds[1, 320000]β 20 seconds[1, 640000]β 40 seconds
Shorter audio is zero-padded to the nearest shape; the CTC decoder trims to actual length.
Performance (iPhone 15 Pro, ANE)
| Model | 4s audio | 20s audio | 40s audio |
|---|---|---|---|
| 300M INT8 | ~100 ms | ~500 ms | ~1.2 s |
| 1B INT8 | ~300 ms | ~1.5 s | ~3.5 s |
Usage
Download a model
pip install huggingface_hub
# Download 300M INT8 (recommended)
huggingface-cli download ChipCracker/omni-asr-coreml \
OmniASR_CTC_300M_int8.mlmodelc --local-dir ./models
Load in Swift
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try await MLModel.load(
contentsOf: modelURL,
configuration: config
)
Decode with greedy CTC
// After model.prediction(from: features):
// 1. Argmax over vocabulary dimension
// 2. Remove consecutive duplicates
// 3. Remove blank token (index 0)
// 4. Map indices to vocabulary tokens
// 5. Join and replace SentencePiece boundary (β) with space
iOS App
These models are used by the omni-asr iOS app which provides:
- Live transcription with growing context
- On-demand model download from this repository
- Full offline operation after download
Export
Models were exported from PyTorch using coremltools 9.0:
omni-asr-export \
--model-card omniASR_CTC_300M \
--output OmniASR_CTC_300M_int8.mlpackage
# INT8 quantization is applied by default
INT8 variants use post-training linear symmetric weight quantization, reducing size ~2x with minimal accuracy loss.
File Structure
Each .mlmodelc directory contains:
OmniASR_CTC_300M_int8.mlmodelc/
βββ coremldata.bin # Model graph serialization
βββ metadata.json # CoreML metadata
βββ model.mil # ML Intermediate Language
βββ analytics/coremldata.bin
βββ weights/weight.bin # Model weights (largest file)
Citation
@article{pratap2023scaling,
title={Scaling Speech Technology to 1,000+ Languages},
author={Pratap, Vineel and others},
journal={arXiv preprint arXiv:2305.13516},
year={2023}
}
License
The CoreML conversion and app code are provided under CC-BY-NC-4.0. The original Omni-ASR model weights are subject to Meta's license terms.