Multilingual E5 Small — Core ML Embedder (256/512)

This repository provides a Core ML conversion of the Hugging Face model:

Upstream model: intfloat/multilingual-e5-small
Purpose: on-device text embeddings for semantic search / retrieval / RAG
Output: a 384-dim L2-normalized embedding vector
Deployment target: iOS 16+

This is a community conversion for on-device use. It is not an official release from the upstream author.

What’s inside

A Core ML .mlpackage model that outputs embeddings using:
- Mean pooling over the last hidden states (masked by attention_mask)
- L2 normalization to unit length

Supported sequence lengths

This model is designed for the common “fast recall + deep rerank” strategy:

256 tokens (default, fast / energy-friendly)
512 tokens (deep / higher quality for long passages)

Important: E5 prefix rules (do this or quality drops)

E5 retrieval is trained with prefixes:

Document chunk / indexed text: passage: ...
User query: query: ...

Examples:

passage: Invoice total is $199.00 due on Dec 17, 2025.
query: What is the total amount due?

Inputs / Outputs

Output

embedding: float32 (or float16 depending on Core ML execution), shape [1, 384]
Vector is already L2-normalized, so cosine similarity is just dot product.

Input format (packed single-input variant)

To support iOS 16 with fixed enumerated shapes (256 / 512) using one flexible input, the model expects:

packed: int32 multiarray of shape [1, 2, T]
- T is 256 or 512
- packed[0, 0, :] = input_ids
- packed[0, 1, :] = attention_mask

If your .mlpackage in this repo uses separate inputs (input_ids, attention_mask), refer to the conversion script you used. The DocSnap-recommended packaging is the packed single-input model above.

How to tokenize (recommended settings)

Use the upstream tokenizer:

truncation=True
padding="max_length"
max_length = 256 (recall) or 512 (rerank)
Ensure input_ids and attention_mask are the same length T.

Usage examples

Python (sanity check embeddings vs upstream)

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

MODEL_ID = "intfloat/multilingual-e5-small"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModel.from_pretrained(MODEL_ID).eval()

def e5_embed(text: str, max_len: int):
    batch = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=max_len,
    )
    with torch.no_grad():
        out = model(**batch)
        x = out.last_hidden_state
        mask = batch["attention_mask"].unsqueeze(-1).to(x.dtype)
        x = x * mask
        pooled = x.sum(dim=1) / mask.sum(dim=1).clamp(min=1e-6)
        emb = pooled / torch.linalg.norm(pooled, dim=1, keepdim=True).clamp(min=1e-12)
    return emb.cpu().numpy()

print(e5_embed("passage: Hello world", 256).shape)  # (1, 384)

Downloads last month: -; Downloads are not tracked for this model. How to track