HPLT
/

hplt_bert_base_en

Model card Files Files and versions

ltgoslo commited on Apr 21, 2024

Commit

4fbe7e4

·

verified ·

1 Parent(s): 12e1c39

Readme polished

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -5,9 +5,10 @@ inference: false
 tags:
 - BERT
 - HPLT
-- English
 - encoder
 license: apache-2.0
 ---
 # HPLT Bert for English
@@ -17,7 +18,7 @@ license: apache-2.0
 This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
 It is a so called masked language models. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
-A monolingual LTG-BERT model is trained for every major language in the HPLT 1.2 data release (*75* models total).
 All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
 - hidden size: 768
@@ -26,12 +27,11 @@ All the HPLT encoder-only models use the same hyper-parameters, roughly followin
 - vocabulary size: 32768
 Every model uses its own tokenizer trained on language-specific HPLT data.
-[The training statistics of all 75 runs](https://api.wandb.ai/links/ltg/kduj7mjn)
 See sizes of the training corpora, evaluation results and more in our [language model training report](https://hplt-project.org/HPLT_D4_1___First_language_models_trained.pdf).
-The training code is published on the [HPLT GitHub repository](https://github.com/hplt-project/HPLT-WP4).
 ## Example usage

 tags:
 - BERT
 - HPLT
 - encoder
 license: apache-2.0
+datasets:
+- HPLT/hplt_monolingual_v1_2
 ---
 # HPLT Bert for English
 This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
 It is a so called masked language models. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
+A monolingual LTG-BERT model is trained for every major language in the [HPLT 1.2 data release](https://hplt-project.org/datasets/v1.2) (*75* models total).
 All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
 - hidden size: 768
 - vocabulary size: 32768
 Every model uses its own tokenizer trained on language-specific HPLT data.
 See sizes of the training corpora, evaluation results and more in our [language model training report](https://hplt-project.org/HPLT_D4_1___First_language_models_trained.pdf).
+[The training code](https://github.com/hplt-project/HPLT-WP4).
+[The training statistics of all 75 runs](https://api.wandb.ai/links/ltg/kduj7mjn)
 ## Example usage