---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:40000
- loss:MSELoss
- multilingual
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
- source_sentence: Who is filming along?
sentences:
- Wién filmt mat?
- >-
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
- Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
sentences:
- >-
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
gëtt jo een ganz neie Wunnquartier gebaut.
- >-
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
eso'gucr me' we' 90 prozent.
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
Non-profit organisation Passerell, which provides legal council to refugees
in Luxembourg, announced that it has to make four employees redundant in
August due to a lack of funding.
sentences:
- Oetringen nach Remich....8.20» 215»
- >-
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
sentences:
- Six Jours vu New-York si fir d’équipe Girgetti — Debacco
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
sentences:
- D'grenzarbechetr missten och me' lo'n kre'en.
- >-
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
gemâcht!
- >-
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
SentenceTransformer based on
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
results:
- task:
type: contemporary-lb
name: Contemporary-lb
dataset:
name: Contemporary-lb
type: contemporary-lb
metrics:
- type: accuracy
value: 0.594
name: SIB-200(LB) accuracy
- type: accuracy
value: 0.805
name: ParaLUX accuracy
- task:
type: bitext-mining
name: LBHistoricalBitextMining
dataset:
name: LBHistoricalBitextMining
type: lb-en
metrics:
- type: accuracy
value: 0.8932
name: LB<->FR accuracy
- type: accuracy
value: 0.8955
name: LB<->EN accuracy
- type: mean_accuracy
value: 0.9144
name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---
# Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an [paraphrase-multilingual-mpnet-base-v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) model that was further adapted by (Michail et al., 2025)
## Limitations
This model only supports inputs of up to 128 subtokens long.
We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using [histlux-gte-multilingual-base](https://huggingface.co/impresso-project/histlux-gte-multilingual-base)
However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)
- **Maximum Sequence Length:** 128 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
- LB-EN (Historical, Modern)
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
# Run inference
sentences = [
'The cross-border workers should also receive more wages.',
"D'grenzarbechetr missten och me' lo'n kre'en.",
"De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
## Evaluation
### Metrics
(see introducing paper)
Historical Bitext Mining (Accuracy):
LB -> FR: 88.6
FR -> LB: 90.0
LB -> EN: 88.7
EN -> LB: 90.4
LB -> DE: 91.1
DE -> LB: 91.8
Contemporary LB (Accuracy):
ParaLUX: 80.5
SIB-200(LB): 59.4
## Training Details
### Training Dataset
#### LB-EN (Historical, Modern)
* Dataset: lb-en (mixed)
* Size: 40,000 training samples
* Columns: english, luxembourgish, and label (teacher's en embeddings)
* Approximate statistics based on the first 1000 samples:
| | english | luxembourgish | label |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-------------------------------------|
| type | string | string | list |
| details |
A lesson for the next year | Eng le’er fir dat anert joer | [0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...] |
| On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station. | Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare. | [-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...] |
| The happiness, the peace is long gone now, | V ergângen ass nu läng dat gléck, de' fréd, | [0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...] |
* Loss: [MSELoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss)
### Evaluation Dataset
#### Non-Default Hyperparameters
- `learning_rate`: 2e-05
- `num_train_epochs`: 5
- `warmup_ratio`: 0.1
- `bf16`: True
- Rest are default
-
### Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0
- Accelerate: 1.4.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
## Citation
### BibTeX
#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
```bibtex
@misc{michail2025adaptingmultilingualembeddingmodels,
title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
year={2025},
eprint={2502.07938},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07938},
}
```
#### Multilingual Knowledge Distillation
```bibtex
@inproceedings{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2004.09813",
}
```
## About Impresso
### Impresso project
[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
### Copyright
Copyright (C) 2025 The Impresso team.
### License
This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.
---