🥃 Distilling Tiny Embeddings

Community Article Published January 10, 2026

NeuML

This article introduces a new series of BERT Hash Embeddings models. It builds on the previously released BERT Hash model series. These models generate fixed dimensional vectors that can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

The BERT Hash Embeddings models are a strong alternative to MUVERA fixed-dimensional encoding with ColBERT models. MUVERA encoding enables encoding the multi-vector outputs of ColBERT into single dense vector outputs. While this is a great step, the main issue with MUVERA is that it tends to need wide vectors to be effective (5K - 10K dimensional vectors).

The following new models are released as part of this effort. All models have an Apache 2.0 license.

Model	Description
bert-hash-femto-embeddings	244K parameter 50 dimensional vector embeddings model
bert-hash-pico-embeddings	448K parameter 80 dimensional vector embeddings model
bert-hash-nano-embeddings	970K parameter 128 dimensional vector embeddings model

Training Process

The key to building micromodels is distillation. Knowledge distillation as defined in Wikipedia is "the process of transferring knowledge from a large model to a smaller one". The paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models established that training models from scratch then distilling knowledge for downstream tasks using a larger teacher model generates the best results for small models.

The training dataset is a subset of this embedding training collection. The training workflow was a two step distillation process as follows.

Distill embeddings from the larger mixedbread-ai/mxbai-embed-xsmall-v1 model using this model distillation script from Sentence Transformers.
Build a distilled dataset of teacher scores using the mixedbread-ai/mxbai-rerank-xsmall-v1 cross-encoder for a random sample of the training dataset mentioned above.
Further fine-tune the model on the distilled dataset using KLDivLoss.

A LOT of different vector models were tried for both the first step and second step.

sentence-transformers/all-MiniLM-L6-v2
mixedbread-ai/mxbai-embed-large-v1
ibm-granite/granite-embedding-small-english-r2, ibm-granite/granite-embedding-english-r2
embeddinggemma-300m
Qwen/Qwen3-Embedding-0.6B
BAAI/bge-base-en-v1.5
MongoDB/mdbr-leaf-mt, MongoDB/mdbr-leaf-ir
intfloat/e5-small, intfloat/multilingual-e5-small
nomic-ai/nomic-embed-text-v1.5
sentence-transformers/all-mpnet-base-v2

The two models that worked best were mxbai-embed-xsmall-v1 and all-MiniLM-L6-v2. The best theory on that is that models this small don't have enough capacity to learn intricate details and simpler is better.

The same was found for the KLDivLoss distillation step. The following encoders were tried.

cross-encoder/ms-marco-MiniLM-L6-v2
mixedbread-ai/mxbai-rerank-large-v1
tomaarsen/Qwen3-Reranker-0.6B-seq-cls
ibm-granite/granite-embedding-reranker-english-r2
dleemiller/CrossGemma-sts-300m
dleemiller/ModernCE-large-sts

Once again it seemed that simpler models distill better.

Evaluation Results

The following table shows a subset of BEIR scored with the txtai benchmarks script.

This evaluation is compared against the ColBERT MUVERA series of models.

Scores reported are ndcg@10 and grouped into the following three categories.

BERT Hash Embeddings vs MUVERA

Model	Parameters	NFCorpus	SciDocs	SciFact	Average
BERT Hash Femto Embeddings	0.2M	0.1402	0.0443	0.2830	0.1558
BERT Hash Pico Embeddings	0.4M	0.2075	0.0812	0.3912	0.2266
BERT Hash Nano Embeddings	0.9M	0.2562	0.1179	0.5032	0.2924
ColBERT MUVERA Femto	0.2M	0.1851	0.0411	0.3518	0.1927
ColBERT MUVERA Pico	0.4M	0.1926	0.0564	0.4424	0.2305
ColBERT MUVERA Nano	0.9M	0.2355	0.0807	0.4904	0.2689

BERT Hash Embeddings vs MUVERA with maxsim re-ranking of the top 100 results per MUVERA paper

Model	Parameters	NFCorpus	SciDocs	SciFact	Average
BERT Hash Femto Embeddings	0.2M	0.2242	0.0801	0.4719	0.2587
BERT Hash Pico Embeddings	0.4M	0.2702	0.1104	0.5965	0.3257
BERT Hash Nano Embeddings	0.9M	0.3101	0.1347	0.6327	0.3592
ColBERT MUVERA Femto	0.2M	0.2316	0.0858	0.4641	0.2605
ColBERT MUVERA Pico	0.4M	0.2821	0.1004	0.6090	0.3305
ColBERT MUVERA Nano	0.9M	0.2996	0.1201	0.6249	0.3482

Compare to other models

Model	Parameters	NFCorpus	SciDocs	SciFact	Average
ColBERT MUVERA Femto (full multi-vector maxsim)	0.2M	0.2513	0.0870	0.4710	0.2698
ColBERT MUVERA Pico (full multi-vector maxsim)	0.4M	0.3005	0.1117	0.6452	0.3525
ColBERT MUVERA Nano (full multi-vector maxsim)	0.9M	0.3180	0.1262	0.6576	0.3673
all-MiniLM-L6-v2	22.7M	0.3089	0.2164	0.6527	0.3927
mxbai-embed-xsmall-v1	24.1M	0.3186	0.2155	0.6598	0.3980

In analyzing the results, bert-hash-nano-embeddings is better across the board vs MUVERA with colbert-muvera-nano. It keeps 98% of the performance of full multi-vector maxsim vs 95% for MUVERA. Comparing the standard MUVERA output of 10240 vs 128 dimensions, 10K standard F32 vectors needs 400 MB of storage vs 5 MB.

bert-hash-pico-embeddings and bert-hash-femto-embeddings are also competitive but aren't quite as impressive proportionally as bert-hash-nano-embeddings.

For a 970K parameter model, the scores are really good. When paired with re-ranking with a 970K ColBERT model, the scores are even better. Competitive with common small models as shown above at only ~4% of the number of parameters.

Wrapping Up

This article introduced the BERT Hash Embeddings series of models. These models all pack quite a punch for being less than 1 million parameters. bert-hash-nano-embeddings is particularly impressive and has a lot of potential to be used in edge and low resource compute environments. When paired with a Nano ColBERT re-ranker, the combination is even more powerful. Both models can be exported using frameworks like LiteRT and ExecuTorch.

Use cases include on-device semantic search, similarity comparisons, LLM chunking and Retrieval Augmented Generation (RAG). The advantage is that data never needs to leave the device while still having solid performance.

One big takeaway from this effort is that complex billion parameter models can't just be distilled down into much smaller networks. Another is that the distillation process works better in a sequential fashion. In other words, using each distilled model to train the smaller model worked better than using the original larger model.

A future task would be to investigate if for example, one could distill embeddinggemma-300m to 100m to 50m to 10m to 1m.

If you're interested in building custom models like this for your data or domain area, feel free to reach out!

NeuML is the company behind txtai and we provide AI consulting services around our stack. Schedule a meeting or send a message to learn more.

We're also building an easy and secure way to run hosted txtai applications with txtai.cloud.

Community

webxos

1 day ago

Working on similar ideas. Micro models that require less data/resources. A focus on niche use cases. Innovation for sure. I've been inspired by the work of Karpathy's Nanochat + Ministries. Multi-agent horizontals over hyper verticals. Segmented. Controlled. Delegated.

davidmezzetti

Article author 1 day ago

Glad to hear it, good luck!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote