Readme polished
Browse files
README.md
CHANGED
|
@@ -5,9 +5,10 @@ inference: false
|
|
| 5 |
tags:
|
| 6 |
- BERT
|
| 7 |
- HPLT
|
| 8 |
-
- English
|
| 9 |
- encoder
|
| 10 |
license: apache-2.0
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# HPLT Bert for English
|
|
@@ -17,7 +18,7 @@ license: apache-2.0
|
|
| 17 |
This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
|
| 18 |
It is a so called masked language models. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
|
| 19 |
|
| 20 |
-
A monolingual LTG-BERT model is trained for every major language in the HPLT 1.2 data release (*75* models total).
|
| 21 |
|
| 22 |
All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
|
| 23 |
- hidden size: 768
|
|
@@ -26,12 +27,11 @@ All the HPLT encoder-only models use the same hyper-parameters, roughly followin
|
|
| 26 |
- vocabulary size: 32768
|
| 27 |
|
| 28 |
Every model uses its own tokenizer trained on language-specific HPLT data.
|
| 29 |
-
|
| 30 |
-
[The training statistics of all 75 runs](https://api.wandb.ai/links/ltg/kduj7mjn)
|
| 31 |
-
|
| 32 |
See sizes of the training corpora, evaluation results and more in our [language model training report](https://hplt-project.org/HPLT_D4_1___First_language_models_trained.pdf).
|
| 33 |
|
| 34 |
-
The training code
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## Example usage
|
| 37 |
|
|
|
|
| 5 |
tags:
|
| 6 |
- BERT
|
| 7 |
- HPLT
|
|
|
|
| 8 |
- encoder
|
| 9 |
license: apache-2.0
|
| 10 |
+
datasets:
|
| 11 |
+
- HPLT/hplt_monolingual_v1_2
|
| 12 |
---
|
| 13 |
|
| 14 |
# HPLT Bert for English
|
|
|
|
| 18 |
This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
|
| 19 |
It is a so called masked language models. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
|
| 20 |
|
| 21 |
+
A monolingual LTG-BERT model is trained for every major language in the [HPLT 1.2 data release](https://hplt-project.org/datasets/v1.2) (*75* models total).
|
| 22 |
|
| 23 |
All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
|
| 24 |
- hidden size: 768
|
|
|
|
| 27 |
- vocabulary size: 32768
|
| 28 |
|
| 29 |
Every model uses its own tokenizer trained on language-specific HPLT data.
|
|
|
|
|
|
|
|
|
|
| 30 |
See sizes of the training corpora, evaluation results and more in our [language model training report](https://hplt-project.org/HPLT_D4_1___First_language_models_trained.pdf).
|
| 31 |
|
| 32 |
+
[The training code](https://github.com/hplt-project/HPLT-WP4).
|
| 33 |
+
|
| 34 |
+
[The training statistics of all 75 runs](https://api.wandb.ai/links/ltg/kduj7mjn)
|
| 35 |
|
| 36 |
## Example usage
|
| 37 |
|