---
language:
- en
- az
license: cc-by-4.0
tags:
- sentence-embeddings
- sentence-similarity
- text-embeddings
- bilingual
- azerbaijani
- english
- all-minilm-l6-v2
- bge-small-en-v1.5
- distillation
pipeline_tag: sentence-similarity
model-index:
- name: Lroc/az-en-MiniLM-L6-v2-30M
  results:
  - task:
      type: Semantic Textual Similarity
      name: Semantic Textual Similarity (Azerbaijani)
    dataset:
      name: Azerbaijani STS Benchmarks (Average)
      type: LocalDoc/Azerbaijani-STS-Average
    metrics:
    - type: Pearson Correlation
      value: 0.7266
      name: Average Pearson
      verified: false
---

# Bilingual Azerbaijani-English Sentence Embedding Model (az-en-MiniLM-L6-v2)

This is a sentence-transformer model that maps sentences & paragraphs in **Azerbaijani (az)** and **English (en)** to a 384-dimensional dense vector space. 
It is designed for tasks like semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering for these two languages.

The model is based on `sentence-transformers/all-MiniLM-L6-v2` and was fine-tuned using knowledge distillation from the high-performance `BAAI/bge-small-en-v1.5` English embedding model. 
A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer with a vocabulary of ~50k was trained from scratch and is used by this model.


## Model Details

*   **Base Architecture:** `sentence-transformers/all-MiniLM-L6-v2` (6 layers, 384 hidden dimension, 12 attention heads)
*   **Parameters:** ~30.2 Million (after vocabulary expansion)
*   **Tokenizer:** Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at [LocalDoc/az-en-unigram-tokenizer-50k](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k). You can get train code from this repository https://github.com/vrashad/azerbaijani_tokenizer
*   **Output Dimension:** 384
*   **Max Sequence Length:** 512 tokens
*   **Training:** Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from `BAAI/bge-small-en-v1.5`.

## Performance on Azerbaijani STS Benchmarks

This model demonstrates strong performance on Azerbaijani Semantic Textual Similarity (STS) tasks  [LocalDoc-Azerbaijan/STS-Benchmark](https://github.com/LocalDoc-Azerbaijan/STS-Benchmark), achieving results competitive with, and in some cases surpassing, larger multilingual models.

The following results were obtained after **3 epochs** of training :

| Dataset | Pearson Correlation |
| :-------------------------------------- | :------------------: |
| LocalDoc/Azerbaijani-STSBenchmark | 0.7595 |
| LocalDoc/Azerbaijani-biosses-sts | 0.7410 |
| LocalDoc/Azerbaijani-sickr-sts | 0.7432 |
| LocalDoc/Azerbaijani-sts12-sts | 0.7644 |
| LocalDoc/Azerbaijani-sts13-sts | 0.6336 |
| LocalDoc/Azerbaijani-sts15-sts | 0.7597 |
| LocalDoc/Azerbaijani-sts16-sts | 0.6848 |
| **Average Pearson** | **0.7266** |

**Comparison with other models on (assumed) Azerbaijani STS Benchmarks (Average Pearson):**

*   LocalDoc/TEmA-small: `0.7959`
*   Cohere/embed-multilingual-v3.0: `0.7823`
*   BAAI/bge-m3: `0.7577`
*   intfloat/multilingual-e5-large-instruct: `0.7377`
*   Cohere/embed-multilingual-v2.0: `0.7318`
*   intfloat/multilingual-e5-large: `0.7280`
*   OpenAI/text-embedding-3-large: `0.7288`
*   **LocalDoc/az-en-MiniLM-L6-v2: `0.7266`**
*   sentence-transformers/LaBSE: `0.7250`
*   intfloat/multilingual-e5-small: `0.7242`
*   Cohere/embed-multilingual-light-v3.0: `0.7142`
*   intfloat/multilingual-e5-base: `0.6960`


## How to Use

First, install the `sentence-transformers` library:
```bash
pip install -U sentence-transformers
```

```python
from sentence_transformers import SentenceTransformer

model_id = "LocalDoc/az-en-MiniLM-L6-v2"

try:
    model = SentenceTransformer(model_id)
    print(f"Model {model_id} loaded successfully!")
except Exception as e:
    print(f"Failed to load model. Ensure the tokenizer 'LocalDoc/az-en-unigram-tokenizer-50k' is accessible and its dependencies (protobuf, sentencepiece_model_pb2.py) are met if loading fails.")
    print(f"Error: {e}")
    # You might need to ensure the tokenizer can be loaded.
    # If the tokenizer requires it (it shouldn't if it's correctly packaged on the Hub by your tokenizer repo):
    # !pip install protobuf
    # !wget -P ./az_en_tokenizer_hf/ https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py
    # model = SentenceTransformer(model_id)


# Example Azerbaijani sentences
sentences_az = [
    "Azərbaycanın paytaxtı Bakı şəhəridir.",
    "Bu gün hava çox istidir."
]

# Example English sentences
sentences_en = [
    "The capital of Azerbaijan is the city of Baku.",
    "The weather is very hot today.",
    "I enjoy reading books."
]

print("\nEncoding Azerbaijani sentences...")
embeddings_az = model.encode(sentences_az)
for sent, emb in zip(sentences_az, embeddings_az):
    print(f"Sentence: {sent}")
    print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n")

print("Encoding English sentences...")
embeddings_en = model.encode(sentences_en)
for sent, emb in zip(sentences_en, embeddings_en):
    print(f"Sentence: {sent}")
    print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n")
```

# Example of calculating similarity

```python
from sentence_transformers.util import cos_sim

similarity_matrix = cos_sim(embeddings_az[0], embeddings_en[0])
print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[0]}': {similarity_matrix.item():.4f}")

similarity_matrix_diff = cos_sim(embeddings_az[0], embeddings_en[2])
print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[2]}': {similarity_matrix_diff.item():.4f}")
```

## Training

This model was fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` using a **knowledge distillation** setup.

- **Teacher Model:** [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5) (used to generate target embeddings for English sentences).
- **Student Model:** Initialized from `sentence-transformers/all-MiniLM-L6-v2`.
- **Tokenizer:** A custom bilingual (Azerbaijani-English) [SentencePiece Unigram tokenizer](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k) (`LocalDoc/az-en-unigram-tokenizer-50k`) was used.  
  The student model's token embedding layer was resized to match the new vocabulary size (~50k).
- **Training Data:** A parallel corpus of approximately **4.14 million Azerbaijani-English sentence pairs**.
- **Loss Function:** `MSELoss` — the student model was trained to produce embeddings for both Azerbaijani and English sentences that are similar to the teacher model's embeddings for the corresponding **English** sentences.

### Training Hyperparameters

- **Epochs:** 3  
- **Batch Size:** 64  
- **Max Sequence Length:** 512  
- **Learning Rate:** 3e-4  
- **Warmup Ratio:** 0.15


## CC BY 4.0 License — What It Allows

The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:

You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.

For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>.


## Contact

For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].