๐ฅ Distilling Tiny Embeddings
This article introduces a new series of BERT Hash Embeddings models. It builds on the previously released BERT Hash model series. These models generate fixed dimensional vectors that can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
The BERT Hash Embeddings models are a strong alternative to MUVERA fixed-dimensional encoding with ColBERT models. MUVERA encoding enables encoding the multi-vector outputs of ColBERT into single dense vector outputs. While this is a great step, the main issue with MUVERA is that it tends to need wide vectors to be effective (5K - 10K dimensional vectors).
The following new models are released as part of this effort. All models have an Apache 2.0 license.
| Model | Description |
|---|---|
| bert-hash-femto-embeddings | 244K parameter 50 dimensional vector embeddings model |
| bert-hash-pico-embeddings | 448K parameter 80 dimensional vector embeddings model |
| bert-hash-nano-embeddings | 970K parameter 128 dimensional vector embeddings model |
Training Process
The key to building micromodels is distillation. Knowledge distillation as defined in Wikipedia is "the process of transferring knowledge from a large model to a smaller one". The paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models established that training models from scratch then distilling knowledge for downstream tasks using a larger teacher model generates the best results for small models.
The training dataset is a subset of this embedding training collection. The training workflow was a two step distillation process as follows.
- Distill embeddings from the larger mixedbread-ai/mxbai-embed-xsmall-v1 model using this model distillation script from Sentence Transformers.
- Build a distilled dataset of teacher scores using the mixedbread-ai/mxbai-rerank-xsmall-v1 cross-encoder for a random sample of the training dataset mentioned above.
- Further fine-tune the model on the distilled dataset using KLDivLoss.
A LOT of different vector models were tried for both the first step and second step.
sentence-transformers/all-MiniLM-L6-v2mixedbread-ai/mxbai-embed-large-v1ibm-granite/granite-embedding-small-english-r2,ibm-granite/granite-embedding-english-r2embeddinggemma-300mQwen/Qwen3-Embedding-0.6BBAAI/bge-base-en-v1.5MongoDB/mdbr-leaf-mt,MongoDB/mdbr-leaf-irintfloat/e5-small,intfloat/multilingual-e5-smallnomic-ai/nomic-embed-text-v1.5sentence-transformers/all-mpnet-base-v2
The two models that worked best were mxbai-embed-xsmall-v1 and all-MiniLM-L6-v2. The best theory on that is that models this small don't have enough capacity to learn intricate details and simpler is better.
The same was found for the KLDivLoss distillation step. The following encoders were tried.
cross-encoder/ms-marco-MiniLM-L6-v2mixedbread-ai/mxbai-rerank-large-v1tomaarsen/Qwen3-Reranker-0.6B-seq-clsibm-granite/granite-embedding-reranker-english-r2dleemiller/CrossGemma-sts-300mdleemiller/ModernCE-large-sts
Once again it seemed that simpler models distill better.
Evaluation Results
The following table shows a subset of BEIR scored with the txtai benchmarks script.
This evaluation is compared against the ColBERT MUVERA series of models.
Scores reported are ndcg@10 and grouped into the following three categories.
BERT Hash Embeddings vs MUVERA
| Model | Parameters | NFCorpus | SciDocs | SciFact | Average |
|---|---|---|---|---|---|
| BERT Hash Femto Embeddings | 0.2M | 0.1402 | 0.0443 | 0.2830 | 0.1558 |
| BERT Hash Pico Embeddings | 0.4M | 0.2075 | 0.0812 | 0.3912 | 0.2266 |
| BERT Hash Nano Embeddings | 0.9M | 0.2562 | 0.1179 | 0.5032 | 0.2924 |
| ColBERT MUVERA Femto | 0.2M | 0.1851 | 0.0411 | 0.3518 | 0.1927 |
| ColBERT MUVERA Pico | 0.4M | 0.1926 | 0.0564 | 0.4424 | 0.2305 |
| ColBERT MUVERA Nano | 0.9M | 0.2355 | 0.0807 | 0.4904 | 0.2689 |
BERT Hash Embeddings vs MUVERA with maxsim re-ranking of the top 100 results per MUVERA paper
| Model | Parameters | NFCorpus | SciDocs | SciFact | Average |
|---|---|---|---|---|---|
| BERT Hash Femto Embeddings | 0.2M | 0.2242 | 0.0801 | 0.4719 | 0.2587 |
| BERT Hash Pico Embeddings | 0.4M | 0.2702 | 0.1104 | 0.5965 | 0.3257 |
| BERT Hash Nano Embeddings | 0.9M | 0.3101 | 0.1347 | 0.6327 | 0.3592 |
| ColBERT MUVERA Femto | 0.2M | 0.2316 | 0.0858 | 0.4641 | 0.2605 |
| ColBERT MUVERA Pico | 0.4M | 0.2821 | 0.1004 | 0.6090 | 0.3305 |
| ColBERT MUVERA Nano | 0.9M | 0.2996 | 0.1201 | 0.6249 | 0.3482 |
Compare to other models
| Model | Parameters | NFCorpus | SciDocs | SciFact | Average |
|---|---|---|---|---|---|
| ColBERT MUVERA Femto (full multi-vector maxsim) | 0.2M | 0.2513 | 0.0870 | 0.4710 | 0.2698 |
| ColBERT MUVERA Pico (full multi-vector maxsim) | 0.4M | 0.3005 | 0.1117 | 0.6452 | 0.3525 |
| ColBERT MUVERA Nano (full multi-vector maxsim) | 0.9M | 0.3180 | 0.1262 | 0.6576 | 0.3673 |
| all-MiniLM-L6-v2 | 22.7M | 0.3089 | 0.2164 | 0.6527 | 0.3927 |
| mxbai-embed-xsmall-v1 | 24.1M | 0.3186 | 0.2155 | 0.6598 | 0.3980 |
In analyzing the results, bert-hash-nano-embeddings is better across the board vs MUVERA with colbert-muvera-nano. It keeps 98% of the performance of full multi-vector maxsim vs 95% for MUVERA. Comparing the standard MUVERA output of 10240 vs 128 dimensions, 10K standard F32 vectors needs 400 MB of storage vs 5 MB.
bert-hash-pico-embeddings and bert-hash-femto-embeddings are also competitive but aren't quite as impressive proportionally as bert-hash-nano-embeddings.
For a 970K parameter model, the scores are really good. When paired with re-ranking with a 970K ColBERT model, the scores are even better. Competitive with common small models as shown above at only ~4% of the number of parameters.
Wrapping Up
This article introduced the BERT Hash Embeddings series of models. These models all pack quite a punch for being less than 1 million parameters. bert-hash-nano-embeddings is particularly impressive and has a lot of potential to be used in edge and low resource compute environments. When paired with a Nano ColBERT re-ranker, the combination is even more powerful. Both models can be exported using frameworks like LiteRT and ExecuTorch.
Use cases include on-device semantic search, similarity comparisons, LLM chunking and Retrieval Augmented Generation (RAG). The advantage is that data never needs to leave the device while still having solid performance.
One big takeaway from this effort is that complex billion parameter models can't just be distilled down into much smaller networks. Another is that the distillation process works better in a sequential fashion. In other words, using each distilled model to train the smaller model worked better than using the original larger model.
A future task would be to investigate if for example, one could distill embeddinggemma-300m to 100m to 50m to 10m to 1m.
If you're interested in building custom models like this for your data or domain area, feel free to reach out!
NeuML is the company behind txtai and we provide AI consulting services around our stack. Schedule a meeting or send a message to learn more.
We're also building an easy and secure way to run hosted txtai applications with txtai.cloud.