--- language: - en - az license: cc-by-4.0 tags: - sentence-embeddings - sentence-similarity - text-embeddings - bilingual - azerbaijani - english - all-minilm-l6-v2 - bge-small-en-v1.5 - distillation pipeline_tag: sentence-similarity model-index: - name: Lroc/az-en-MiniLM-L6-v2-30M results: - task: type: Semantic Textual Similarity name: Semantic Textual Similarity (Azerbaijani) dataset: name: Azerbaijani STS Benchmarks (Average) type: LocalDoc/Azerbaijani-STS-Average metrics: - type: Pearson Correlation value: 0.7266 name: Average Pearson verified: false --- # Bilingual Azerbaijani-English Sentence Embedding Model (az-en-MiniLM-L6-v2) This is a sentence-transformer model that maps sentences & paragraphs in **Azerbaijani (az)** and **English (en)** to a 384-dimensional dense vector space. It is designed for tasks like semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering for these two languages. The model is based on `sentence-transformers/all-MiniLM-L6-v2` and was fine-tuned using knowledge distillation from the high-performance `BAAI/bge-small-en-v1.5` English embedding model. A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer with a vocabulary of ~50k was trained from scratch and is used by this model. ## Model Details * **Base Architecture:** `sentence-transformers/all-MiniLM-L6-v2` (6 layers, 384 hidden dimension, 12 attention heads) * **Parameters:** ~30.2 Million (after vocabulary expansion) * **Tokenizer:** Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at [LocalDoc/az-en-unigram-tokenizer-50k](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k). You can get train code from this repository https://github.com/vrashad/azerbaijani_tokenizer * **Output Dimension:** 384 * **Max Sequence Length:** 512 tokens * **Training:** Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from `BAAI/bge-small-en-v1.5`. ## Performance on Azerbaijani STS Benchmarks This model demonstrates strong performance on Azerbaijani Semantic Textual Similarity (STS) tasks [LocalDoc-Azerbaijan/STS-Benchmark](https://github.com/LocalDoc-Azerbaijan/STS-Benchmark), achieving results competitive with, and in some cases surpassing, larger multilingual models. The following results were obtained after **3 epochs** of training : | Dataset | Pearson Correlation | | :-------------------------------------- | :------------------: | | LocalDoc/Azerbaijani-STSBenchmark | 0.7595 | | LocalDoc/Azerbaijani-biosses-sts | 0.7410 | | LocalDoc/Azerbaijani-sickr-sts | 0.7432 | | LocalDoc/Azerbaijani-sts12-sts | 0.7644 | | LocalDoc/Azerbaijani-sts13-sts | 0.6336 | | LocalDoc/Azerbaijani-sts15-sts | 0.7597 | | LocalDoc/Azerbaijani-sts16-sts | 0.6848 | | **Average Pearson** | **0.7266** | **Comparison with other models on (assumed) Azerbaijani STS Benchmarks (Average Pearson):** * LocalDoc/TEmA-small: `0.7959` * Cohere/embed-multilingual-v3.0: `0.7823` * BAAI/bge-m3: `0.7577` * intfloat/multilingual-e5-large-instruct: `0.7377` * Cohere/embed-multilingual-v2.0: `0.7318` * intfloat/multilingual-e5-large: `0.7280` * OpenAI/text-embedding-3-large: `0.7288` * **LocalDoc/az-en-MiniLM-L6-v2: `0.7266`** * sentence-transformers/LaBSE: `0.7250` * intfloat/multilingual-e5-small: `0.7242` * Cohere/embed-multilingual-light-v3.0: `0.7142` * intfloat/multilingual-e5-base: `0.6960` ## How to Use First, install the `sentence-transformers` library: ```bash pip install -U sentence-transformers ``` ```python from sentence_transformers import SentenceTransformer model_id = "LocalDoc/az-en-MiniLM-L6-v2" try: model = SentenceTransformer(model_id) print(f"Model {model_id} loaded successfully!") except Exception as e: print(f"Failed to load model. Ensure the tokenizer 'LocalDoc/az-en-unigram-tokenizer-50k' is accessible and its dependencies (protobuf, sentencepiece_model_pb2.py) are met if loading fails.") print(f"Error: {e}") # You might need to ensure the tokenizer can be loaded. # If the tokenizer requires it (it shouldn't if it's correctly packaged on the Hub by your tokenizer repo): # !pip install protobuf # !wget -P ./az_en_tokenizer_hf/ https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py # model = SentenceTransformer(model_id) # Example Azerbaijani sentences sentences_az = [ "Azərbaycanın paytaxtı Bakı şəhəridir.", "Bu gün hava çox istidir." ] # Example English sentences sentences_en = [ "The capital of Azerbaijan is the city of Baku.", "The weather is very hot today.", "I enjoy reading books." ] print("\nEncoding Azerbaijani sentences...") embeddings_az = model.encode(sentences_az) for sent, emb in zip(sentences_az, embeddings_az): print(f"Sentence: {sent}") print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n") print("Encoding English sentences...") embeddings_en = model.encode(sentences_en) for sent, emb in zip(sentences_en, embeddings_en): print(f"Sentence: {sent}") print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n") ``` # Example of calculating similarity ```python from sentence_transformers.util import cos_sim similarity_matrix = cos_sim(embeddings_az[0], embeddings_en[0]) print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[0]}': {similarity_matrix.item():.4f}") similarity_matrix_diff = cos_sim(embeddings_az[0], embeddings_en[2]) print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[2]}': {similarity_matrix_diff.item():.4f}") ``` ## Training This model was fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` using a **knowledge distillation** setup. - **Teacher Model:** [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5) (used to generate target embeddings for English sentences). - **Student Model:** Initialized from `sentence-transformers/all-MiniLM-L6-v2`. - **Tokenizer:** A custom bilingual (Azerbaijani-English) [SentencePiece Unigram tokenizer](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k) (`LocalDoc/az-en-unigram-tokenizer-50k`) was used. The student model's token embedding layer was resized to match the new vocabulary size (~50k). - **Training Data:** A parallel corpus of approximately **4.14 million Azerbaijani-English sentence pairs**. - **Loss Function:** `MSELoss` — the student model was trained to produce embeddings for both Azerbaijani and English sentences that are similar to the teacher model's embeddings for the corresponding **English** sentences. ### Training Hyperparameters - **Epochs:** 3 - **Batch Size:** 64 - **Max Sequence Length:** 512 - **Learning Rate:** 3e-4 - **Warmup Ratio:** 0.15 ## CC BY 4.0 License — What It Allows The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows: You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator. For more information, please refer to the CC BY 4.0 license. ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].