almanach
/

camembert-base

@@ -7,58 +7,15 @@ datasets:
 # CamemBERT: a Tasty French Language Model
-## Table of Contents
-- [Model Details](#model-details)
-- [Uses](#uses)
-- [Risks, Limitations and Biases](#risks-limitations-and-biases)
-- [Training](#training)
-- [Evaluation](#evaluation)
-- [Citation Information](#citation-information)
-- [How to Get Started With the Model](#how-to-get-started-with-the-model)
-## Model Details
-- **Model Description:**
-CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
-It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
-- **Developed by:**  Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
-- **Model Type:** Fill-Mask
-- **Language(s):** French
-- **License:** MIT
-- **Parent Model:** See the [RoBERTa base model](https://huggingface.co/roberta-base) for more information about the RoBERTa   base model.
-- **Resources for more information:**
-	- [Research Paper](https://arxiv.org/abs/1911.03894)
-  - [Camembert Website](https://camembert-model.fr/)
-## Uses
-#### Direct Use
-This model can be used for Fill-Mask tasks.
-## Risks, Limitations and Biases
-**CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
-Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
-This model was pretrained on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
-> The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
-> Constructed from Common Crawl, Personal and sensitive information might be present.
-## Training
-#### Training Data
-OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
-#### Training Procedure
 | Model                          | #params                        | Arch. | Training data                     |
 |--------------------------------|--------------------------------|-------|-----------------------------------|
@@ -69,33 +26,15 @@ OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obt
 | `camembert/camembert-base-oscar-4gb`     | 110M    | Base  | Subsample of OSCAR (4 GB of text) |
 | `camembert/camembert-base-ccnet-4gb`     | 110M    | Base  | Subsample of CCNet (4 GB of text) |
-## Evaluation
-The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
-## Citation Information
-```bibtex
-@inproceedings{martin2020camembert,
-  title={CamemBERT: a Tasty French Language Model},
-  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
-  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
-  year={2020}
-}
-```
-## How to Get Started With the Model
 ##### Load CamemBERT and its sub-word tokenizer :
 ```python
 from transformers import CamembertModel, CamembertTokenizer
 # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
-tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
-camembert = CamembertModel.from_pretrained("camembert-base")
 camembert.eval()  # disable dropout (or leave in train mode to finetune)
@@ -105,15 +44,14 @@ camembert.eval()  # disable dropout (or leave in train mode to finetune)
 ```python
 from transformers import pipeline
-camembert_fill_mask  = pipeline("fill-mask", model="camembert-base", tokenizer="camembert-base")
-results = camembert_fill_mask("Le camembert est <mask> :)")
 # results
-#[{'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.4909103214740753, 'token': 7200},
-# {'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.10556930303573608, 'token': 2183},
-# {'sequence': '<s> Le camembert est succulent :)</s>', 'score': 0.03453315049409866, 'token': 26202},
-# {'sequence': '<s> Le camembert est meilleur :)</s>', 'score': 0.03303130343556404, 'token': 528},
-# {'sequence': '<s> Le camembert est parfait :)</s>', 'score': 0.030076518654823303, 'token': 1654}]
 ```
 ##### Extract contextual embedding features from Camembert output
@@ -125,7 +63,7 @@ tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
 # 1-hot encode and add special starting and end tokens
 encoded_sentence = tokenizer.encode(tokenized_sentence)
-# [5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]
 # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
 # Feed tokens to Camembert as a torch tensor (batch dim 1)
@@ -133,9 +71,9 @@ encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
 embeddings, _ = camembert(encoded_sentence)
 # embeddings.detach()
 # embeddings.size torch.Size([1, 10, 768])
-# tensor([[[-0.0254,  0.0235,  0.1027,  ..., -0.1459, -0.0205, -0.0116],
-#         [ 0.0606, -0.1811, -0.0418,  ..., -0.1815,  0.0880, -0.0766],
-#         [-0.1561, -0.1127,  0.2687,  ..., -0.0648,  0.0249,  0.0446],
 #         ...,
 ```
@@ -143,16 +81,33 @@ embeddings, _ = camembert(encoded_sentence)
 ```python
 from transformers import CamembertConfig
 # (Need to reload the model with new config)
-config = CamembertConfig.from_pretrained("camembert-base", output_hidden_states=True)
-camembert = CamembertModel.from_pretrained("camembert-base", config=config)
 embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
 #  all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
 all_layer_embeddings[5]
 # layer 5 contextual embedding : size torch.Size([1, 10, 768])
-#tensor([[[-0.0032,  0.0075,  0.0040,  ..., -0.0025, -0.0178, -0.0210],
-#         [-0.0996, -0.1474,  0.1057,  ..., -0.0278,  0.1690, -0.2982],
-#         [ 0.0557, -0.0588,  0.0547,  ..., -0.0726, -0.0867,  0.0699],
 #         ...,
 ```

 # CamemBERT: a Tasty French Language Model
+## Introduction
+[CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model.
+It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
+For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
+## Pre-trained models
 | Model                          | #params                        | Arch. | Training data                     |
 |--------------------------------|--------------------------------|-------|-----------------------------------|
 | `camembert/camembert-base-oscar-4gb`     | 110M    | Base  | Subsample of OSCAR (4 GB of text) |
 | `camembert/camembert-base-ccnet-4gb`     | 110M    | Base  | Subsample of CCNet (4 GB of text) |
+## How to use CamemBERT with HuggingFace
 ##### Load CamemBERT and its sub-word tokenizer :
 ```python
 from transformers import CamembertModel, CamembertTokenizer
 # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
+tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
+camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")
 camembert.eval()  # disable dropout (or leave in train mode to finetune)
 ```python
 from transformers import pipeline
+camembert_fill_mask  = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
+results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
 # results
+#[{'sequence': '<s> Le camembert est un fromage de chèvre!</s>', 'score': 0.4937814474105835, 'token': 19370},
+#{'sequence': '<s> Le camembert est un fromage de brebis!</s>', 'score': 0.06255942583084106, 'token': 30616},
+#{'sequence': '<s> Le camembert est un fromage de montagne!</s>', 'score': 0.04340197145938873, 'token': 2364},
+# {'sequence': '<s> Le camembert est un fromage de Noël!</s>', 'score': 0.02823255956172943, 'token': 3236},
+#{'sequence': '<s> Le camembert est un fromage de vache!</s>', 'score': 0.021357402205467224, 'token': 12329}]
 ```
 ##### Extract contextual embedding features from Camembert output
 # 1-hot encode and add special starting and end tokens
 encoded_sentence = tokenizer.encode(tokenized_sentence)
+# [5, 221, 10, 10600, 14, 8952, 10540, 75, 1114, 6]
 # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
 # Feed tokens to Camembert as a torch tensor (batch dim 1)
 embeddings, _ = camembert(encoded_sentence)
 # embeddings.detach()
 # embeddings.size torch.Size([1, 10, 768])
+#tensor([[[-0.0928,  0.0506, -0.0094,  ..., -0.2388,  0.1177, -0.1302],
+#         [ 0.0662,  0.1030, -0.2355,  ..., -0.4224, -0.0574, -0.2802],
+#         [-0.0729,  0.0547,  0.0192,  ..., -0.1743,  0.0998, -0.2677],
 #         ...,
 ```
 ```python
 from transformers import CamembertConfig
 # (Need to reload the model with new config)
+config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
+camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)
 embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
 #  all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
 all_layer_embeddings[5]
 # layer 5 contextual embedding : size torch.Size([1, 10, 768])
+#tensor([[[-0.0059, -0.0227,  0.0065,  ..., -0.0770,  0.0369,  0.0095],
+#         [ 0.2838, -0.1531, -0.3642,  ..., -0.0027, -0.8502, -0.7914],
+#         [-0.0073, -0.0338, -0.0011,  ...,  0.0533, -0.0250, -0.0061],
 #         ...,
 ```
+## Authors
+CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
+## Citation
+If you use our work, please cite:
+```bibtex
+@inproceedings{martin2020camembert,
+  title={CamemBERT: a Tasty French Language Model},
+  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
+  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
+  year={2020}
+}
+```