Multilingual E5 Small — Core ML Embedder (256/512)
This repository provides a Core ML conversion of the Hugging Face model:
- Upstream model:
intfloat/multilingual-e5-small - Purpose: on-device text embeddings for semantic search / retrieval / RAG
- Output: a 384-dim L2-normalized embedding vector
- Deployment target: iOS 16+
This is a community conversion for on-device use. It is not an official release from the upstream author.
What’s inside
- A Core ML
.mlpackagemodel that outputs embeddings using:- Mean pooling over the last hidden states (masked by
attention_mask) - L2 normalization to unit length
- Mean pooling over the last hidden states (masked by
Supported sequence lengths
This model is designed for the common “fast recall + deep rerank” strategy:
- 256 tokens (default, fast / energy-friendly)
- 512 tokens (deep / higher quality for long passages)
Important: E5 prefix rules (do this or quality drops)
E5 retrieval is trained with prefixes:
- Document chunk / indexed text:
passage: ... - User query:
query: ...
Examples:
passage: Invoice total is $199.00 due on Dec 17, 2025.query: What is the total amount due?
Inputs / Outputs
Output
embedding:float32(orfloat16depending on Core ML execution), shape[1, 384]- Vector is already L2-normalized, so cosine similarity is just dot product.
Input format (packed single-input variant)
To support iOS 16 with fixed enumerated shapes (256 / 512) using one flexible input, the model expects:
packed:int32multiarray of shape[1, 2, T]Tis 256 or 512packed[0, 0, :]=input_idspacked[0, 1, :]=attention_mask
If your
.mlpackagein this repo uses separate inputs (input_ids,attention_mask), refer to the conversion script you used. The DocSnap-recommended packaging is the packed single-input model above.
How to tokenize (recommended settings)
Use the upstream tokenizer:
truncation=Truepadding="max_length"max_length = 256(recall) or512(rerank)- Ensure
input_idsandattention_maskare the same lengthT.
Usage examples
Python (sanity check embeddings vs upstream)
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
MODEL_ID = "intfloat/multilingual-e5-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModel.from_pretrained(MODEL_ID).eval()
def e5_embed(text: str, max_len: int):
batch = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding="max_length",
max_length=max_len,
)
with torch.no_grad():
out = model(**batch)
x = out.last_hidden_state
mask = batch["attention_mask"].unsqueeze(-1).to(x.dtype)
x = x * mask
pooled = x.sum(dim=1) / mask.sum(dim=1).clamp(min=1e-6)
emb = pooled / torch.linalg.norm(pooled, dim=1, keepdim=True).clamp(min=1e-12)
return emb.cpu().numpy()
print(e5_embed("passage: Hello world", 256).shape) # (1, 384)