Multilingual E5 Small — Core ML Embedder (256/512)

This repository provides a Core ML conversion of the Hugging Face model:

  • Upstream model: intfloat/multilingual-e5-small
  • Purpose: on-device text embeddings for semantic search / retrieval / RAG
  • Output: a 384-dim L2-normalized embedding vector
  • Deployment target: iOS 16+

This is a community conversion for on-device use. It is not an official release from the upstream author.


What’s inside

  • A Core ML .mlpackage model that outputs embeddings using:
    • Mean pooling over the last hidden states (masked by attention_mask)
    • L2 normalization to unit length

Supported sequence lengths

This model is designed for the common “fast recall + deep rerank” strategy:

  • 256 tokens (default, fast / energy-friendly)
  • 512 tokens (deep / higher quality for long passages)

Important: E5 prefix rules (do this or quality drops)

E5 retrieval is trained with prefixes:

  • Document chunk / indexed text: passage: ...
  • User query: query: ...

Examples:

  • passage: Invoice total is $199.00 due on Dec 17, 2025.
  • query: What is the total amount due?

Inputs / Outputs

Output

  • embedding: float32 (or float16 depending on Core ML execution), shape [1, 384]
  • Vector is already L2-normalized, so cosine similarity is just dot product.

Input format (packed single-input variant)

To support iOS 16 with fixed enumerated shapes (256 / 512) using one flexible input, the model expects:

  • packed: int32 multiarray of shape [1, 2, T]
    • T is 256 or 512
    • packed[0, 0, :] = input_ids
    • packed[0, 1, :] = attention_mask

If your .mlpackage in this repo uses separate inputs (input_ids, attention_mask), refer to the conversion script you used. The DocSnap-recommended packaging is the packed single-input model above.


How to tokenize (recommended settings)

Use the upstream tokenizer:

  • truncation=True
  • padding="max_length"
  • max_length = 256 (recall) or 512 (rerank)
  • Ensure input_ids and attention_mask are the same length T.

Usage examples

Python (sanity check embeddings vs upstream)

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

MODEL_ID = "intfloat/multilingual-e5-small"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModel.from_pretrained(MODEL_ID).eval()

def e5_embed(text: str, max_len: int):
    batch = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=max_len,
    )
    with torch.no_grad():
        out = model(**batch)
        x = out.last_hidden_state
        mask = batch["attention_mask"].unsqueeze(-1).to(x.dtype)
        x = x * mask
        pooled = x.sum(dim=1) / mask.sum(dim=1).clamp(min=1e-6)
        emb = pooled / torch.linalg.norm(pooled, dim=1, keepdim=True).clamp(min=1e-12)
    return emb.cpu().numpy()

print(e5_embed("passage: Hello world", 256).shape)  # (1, 384)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support