Hailay commited on
Commit
0fc52a4
Β·
verified Β·
1 Parent(s): c949c55

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ti
3
+ datasets:
4
+ - NLLB
5
+ library_name: transformers
6
+ tags:
7
+ - tigrinya
8
+ - masked-language-modeling
9
+ - xlmr
10
+ - low-resource
11
+ - multilingual
12
+ model_name: XLM-Roberta fine-tuned on Tigrinya (MLM)
13
+ license: apache-2.0
14
+ ---
15
+
16
+ # XLM-Roberta Fine-Tuned on Tigrinya (MLM)
17
+
18
+ This model is a fine-tuned version of [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) for the **Tigrinya language** (α‰΅αŒαˆ­αŠ›), trained with the **Masked Language Modeling (MLM)** objective. It uses a custom BPE tokenizer adapted to Tigrinya using FastText-informed embedding initialization.
19
+
20
+ ## πŸ”§ Details
21
+
22
+ - **Base model**: `xlm-roberta-base`
23
+ - **Language**: Tigrinya
24
+ - **Tokenizer**: Custom BPE tokenizer (non-morpheme-aware)
25
+ - **Adaptation**: Embedding initialization using weighted averages of pretrained XLM-R embeddings, guided by Tigrinya FastText word vectors
26
+ - **Training dataset**: Tigrinya side of the [NLLB (No Language Left Behind)](https://github.com/facebookresearch/flores) parallel corpus
27
+ - **Objective**: Masked Language Modeling (MLM)
28
+
29
+ ## πŸ§ͺ Usage
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("Hailay/xlmr-tigriyna-mlm")
35
+ model = AutoModelForMaskedLM.from_pretrained("Hailay/xlmr-tigriyna-mlm")
36
+
37
+ text = "α‰΅αŒαˆ«α‹­ α‰₯αˆα‰΅αˆ•α‰₯ባ αŠ•αˆ…α‹α‰’ ግα‰₯αˆͺ α‰€αŒΊαˆ‰α’"
38
+ inputs = tokenizer(text, return_tensors="pt")
39
+ outputs = model(**inputs)
40
+ πŸ“Œ Intended Use
41
+ Pretraining for Tigrinya NLP tasks
42
+
43
+ Fine-tuning on classification, NER, QA, and other downstream tasks in Tigrinya
44
+
45
+ Research in low-resource Semitic and morphologically rich languages
46
+
47
+ πŸ“– Citation
48
+ @misc{hailay2025tigrinya,
49
+ title={Tigrinya MLM with XLM-R and FastText-Informed Embedding Initialization},
50
+ author={Hailay Kidu},
51
+ year={2025},
52
+ url={https://huggingface.co/Hailay/xlmr-tigriyna-mlm}
53
+ }
54
+ 🏷️ License
55
+ Apache License 2.0