Use the model card from the authors instead
Browse files
    	
        README.md
    CHANGED
    
    | @@ -7,58 +7,15 @@ datasets: | |
| 7 |  | 
| 8 | 
             
            # CamemBERT: a Tasty French Language Model
         | 
| 9 |  | 
| 10 | 
            -
            ##  | 
| 11 | 
            -
            - [Model Details](#model-details)
         | 
| 12 | 
            -
            - [Uses](#uses)
         | 
| 13 | 
            -
            - [Risks, Limitations and Biases](#risks-limitations-and-biases)
         | 
| 14 | 
            -
            - [Training](#training)
         | 
| 15 | 
            -
            - [Evaluation](#evaluation)
         | 
| 16 | 
            -
            - [Citation Information](#citation-information)
         | 
| 17 | 
            -
            - [How to Get Started With the Model](#how-to-get-started-with-the-model)
         | 
| 18 |  | 
|  | |
| 19 |  | 
| 20 | 
            -
             | 
| 21 | 
            -
            - **Model Description:**
         | 
| 22 | 
            -
            CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
         | 
| 23 | 
            -
            It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
         | 
| 24 | 
            -
            - **Developed by:**  Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
         | 
| 25 | 
            -
            - **Model Type:** Fill-Mask
         | 
| 26 | 
            -
            - **Language(s):** French
         | 
| 27 | 
            -
            - **License:** MIT
         | 
| 28 | 
            -
            - **Parent Model:** See the [RoBERTa base model](https://huggingface.co/roberta-base) for more information about the RoBERTa   base model.
         | 
| 29 | 
            -
            - **Resources for more information:**
         | 
| 30 | 
            -
            	- [Research Paper](https://arxiv.org/abs/1911.03894)
         | 
| 31 | 
            -
              - [Camembert Website](https://camembert-model.fr/)
         | 
| 32 | 
            -
              
         | 
| 33 | 
            -
              
         | 
| 34 | 
            -
            ## Uses
         | 
| 35 |  | 
| 36 | 
            -
             | 
| 37 |  | 
| 38 | 
            -
             | 
| 39 | 
            -
             | 
| 40 | 
            -
             | 
| 41 | 
            -
            ## Risks, Limitations and Biases
         | 
| 42 | 
            -
            **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
         | 
| 43 | 
            -
             | 
| 44 | 
            -
            Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
         | 
| 45 | 
            -
             | 
| 46 | 
            -
            This model was pretrained on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following: 
         | 
| 47 | 
            -
             | 
| 48 | 
            -
            > The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
         | 
| 49 | 
            -
             | 
| 50 | 
            -
            > Constructed from Common Crawl, Personal and sensitive information might be present.
         | 
| 51 | 
            -
             | 
| 52 | 
            -
             | 
| 53 | 
            -
             | 
| 54 | 
            -
            ## Training
         | 
| 55 | 
            -
             | 
| 56 | 
            -
             | 
| 57 | 
            -
            #### Training Data
         | 
| 58 | 
            -
            OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
         | 
| 59 | 
            -
             | 
| 60 | 
            -
             | 
| 61 | 
            -
            #### Training Procedure
         | 
| 62 |  | 
| 63 | 
             
            | Model                          | #params                        | Arch. | Training data                     |
         | 
| 64 | 
             
            |--------------------------------|--------------------------------|-------|-----------------------------------|
         | 
| @@ -69,33 +26,15 @@ OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obt | |
| 69 | 
             
            | `camembert/camembert-base-oscar-4gb`     | 110M    | Base  | Subsample of OSCAR (4 GB of text) |
         | 
| 70 | 
             
            | `camembert/camembert-base-ccnet-4gb`     | 110M    | Base  | Subsample of CCNet (4 GB of text) |
         | 
| 71 |  | 
| 72 | 
            -
            ##  | 
| 73 | 
            -
             | 
| 74 | 
            -
             | 
| 75 | 
            -
            The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
         | 
| 76 | 
            -
             | 
| 77 | 
            -
             | 
| 78 | 
            -
             | 
| 79 | 
            -
            ## Citation Information
         | 
| 80 | 
            -
             | 
| 81 | 
            -
            ```bibtex
         | 
| 82 | 
            -
            @inproceedings{martin2020camembert,
         | 
| 83 | 
            -
              title={CamemBERT: a Tasty French Language Model},
         | 
| 84 | 
            -
              author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
         | 
| 85 | 
            -
              booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
         | 
| 86 | 
            -
              year={2020}
         | 
| 87 | 
            -
            }
         | 
| 88 | 
            -
            ```
         | 
| 89 | 
            -
             | 
| 90 | 
            -
            ## How to Get Started With the Model
         | 
| 91 |  | 
| 92 | 
             
            ##### Load CamemBERT and its sub-word tokenizer :
         | 
| 93 | 
             
            ```python
         | 
| 94 | 
             
            from transformers import CamembertModel, CamembertTokenizer
         | 
| 95 |  | 
| 96 | 
             
            # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
         | 
| 97 | 
            -
            tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
         | 
| 98 | 
            -
            camembert = CamembertModel.from_pretrained("camembert-base")
         | 
| 99 |  | 
| 100 | 
             
            camembert.eval()  # disable dropout (or leave in train mode to finetune)
         | 
| 101 |  | 
| @@ -105,15 +44,14 @@ camembert.eval()  # disable dropout (or leave in train mode to finetune) | |
| 105 | 
             
            ```python
         | 
| 106 | 
             
            from transformers import pipeline 
         | 
| 107 |  | 
| 108 | 
            -
            camembert_fill_mask  = pipeline("fill-mask", model="camembert-base", tokenizer="camembert-base")
         | 
| 109 | 
            -
            results = camembert_fill_mask("Le camembert est <mask | 
| 110 | 
             
            # results
         | 
| 111 | 
            -
            #[{'sequence': '<s> Le camembert est  | 
| 112 | 
            -
            # | 
| 113 | 
            -
            # | 
| 114 | 
            -
            # {'sequence': '<s> Le camembert est  | 
| 115 | 
            -
            # | 
| 116 | 
            -
             | 
| 117 | 
             
            ```
         | 
| 118 |  | 
| 119 | 
             
            ##### Extract contextual embedding features from Camembert output 
         | 
| @@ -125,7 +63,7 @@ tokenized_sentence = tokenizer.tokenize("J'aime le camembert !") | |
| 125 |  | 
| 126 | 
             
            # 1-hot encode and add special starting and end tokens 
         | 
| 127 | 
             
            encoded_sentence = tokenizer.encode(tokenized_sentence)
         | 
| 128 | 
            -
            # [5,  | 
| 129 | 
             
            # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
         | 
| 130 |  | 
| 131 | 
             
            # Feed tokens to Camembert as a torch tensor (batch dim 1)
         | 
| @@ -133,9 +71,9 @@ encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0) | |
| 133 | 
             
            embeddings, _ = camembert(encoded_sentence)
         | 
| 134 | 
             
            # embeddings.detach()
         | 
| 135 | 
             
            # embeddings.size torch.Size([1, 10, 768])
         | 
| 136 | 
            -
            # | 
| 137 | 
            -
            #         [ 0. | 
| 138 | 
            -
            #         [-0. | 
| 139 | 
             
            #         ...,
         | 
| 140 | 
             
            ```
         | 
| 141 |  | 
| @@ -143,16 +81,33 @@ embeddings, _ = camembert(encoded_sentence) | |
| 143 | 
             
            ```python
         | 
| 144 | 
             
            from transformers import CamembertConfig
         | 
| 145 | 
             
            # (Need to reload the model with new config)
         | 
| 146 | 
            -
            config = CamembertConfig.from_pretrained("camembert-base", output_hidden_states=True)
         | 
| 147 | 
            -
            camembert = CamembertModel.from_pretrained("camembert-base", config=config)
         | 
| 148 |  | 
| 149 | 
             
            embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
         | 
| 150 | 
             
            #  all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
         | 
| 151 | 
             
            all_layer_embeddings[5]
         | 
| 152 | 
             
            # layer 5 contextual embedding : size torch.Size([1, 10, 768])
         | 
| 153 | 
            -
            #tensor([[[-0. | 
| 154 | 
            -
            #         [ | 
| 155 | 
            -
            #         [ | 
| 156 | 
             
            #         ...,
         | 
| 157 | 
             
            ```
         | 
| 158 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 7 |  | 
| 8 | 
             
            # CamemBERT: a Tasty French Language Model
         | 
| 9 |  | 
| 10 | 
            +
            ## Introduction
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 11 |  | 
| 12 | 
            +
            [CamemBERT](https://arxiv.org/abs/1911.03894) is a state-of-the-art language model for French based on the RoBERTa model. 
         | 
| 13 |  | 
| 14 | 
            +
            It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains. 
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 15 |  | 
| 16 | 
            +
            For further information or requests, please go to [Camembert Website](https://camembert-model.fr/)
         | 
| 17 |  | 
| 18 | 
            +
            ## Pre-trained models
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 19 |  | 
| 20 | 
             
            | Model                          | #params                        | Arch. | Training data                     |
         | 
| 21 | 
             
            |--------------------------------|--------------------------------|-------|-----------------------------------|
         | 
|  | |
| 26 | 
             
            | `camembert/camembert-base-oscar-4gb`     | 110M    | Base  | Subsample of OSCAR (4 GB of text) |
         | 
| 27 | 
             
            | `camembert/camembert-base-ccnet-4gb`     | 110M    | Base  | Subsample of CCNet (4 GB of text) |
         | 
| 28 |  | 
| 29 | 
            +
            ## How to use CamemBERT with HuggingFace
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 30 |  | 
| 31 | 
             
            ##### Load CamemBERT and its sub-word tokenizer :
         | 
| 32 | 
             
            ```python
         | 
| 33 | 
             
            from transformers import CamembertModel, CamembertTokenizer
         | 
| 34 |  | 
| 35 | 
             
            # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
         | 
| 36 | 
            +
            tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
         | 
| 37 | 
            +
            camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")
         | 
| 38 |  | 
| 39 | 
             
            camembert.eval()  # disable dropout (or leave in train mode to finetune)
         | 
| 40 |  | 
|  | |
| 44 | 
             
            ```python
         | 
| 45 | 
             
            from transformers import pipeline 
         | 
| 46 |  | 
| 47 | 
            +
            camembert_fill_mask  = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
         | 
| 48 | 
            +
            results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
         | 
| 49 | 
             
            # results
         | 
| 50 | 
            +
            #[{'sequence': '<s> Le camembert est un fromage de chèvre!</s>', 'score': 0.4937814474105835, 'token': 19370}, 
         | 
| 51 | 
            +
            #{'sequence': '<s> Le camembert est un fromage de brebis!</s>', 'score': 0.06255942583084106, 'token': 30616}, 
         | 
| 52 | 
            +
            #{'sequence': '<s> Le camembert est un fromage de montagne!</s>', 'score': 0.04340197145938873, 'token': 2364},
         | 
| 53 | 
            +
            # {'sequence': '<s> Le camembert est un fromage de Noël!</s>', 'score': 0.02823255956172943, 'token': 3236}, 
         | 
| 54 | 
            +
            #{'sequence': '<s> Le camembert est un fromage de vache!</s>', 'score': 0.021357402205467224, 'token': 12329}]
         | 
|  | |
| 55 | 
             
            ```
         | 
| 56 |  | 
| 57 | 
             
            ##### Extract contextual embedding features from Camembert output 
         | 
|  | |
| 63 |  | 
| 64 | 
             
            # 1-hot encode and add special starting and end tokens 
         | 
| 65 | 
             
            encoded_sentence = tokenizer.encode(tokenized_sentence)
         | 
| 66 | 
            +
            # [5, 221, 10, 10600, 14, 8952, 10540, 75, 1114, 6]
         | 
| 67 | 
             
            # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
         | 
| 68 |  | 
| 69 | 
             
            # Feed tokens to Camembert as a torch tensor (batch dim 1)
         | 
|  | |
| 71 | 
             
            embeddings, _ = camembert(encoded_sentence)
         | 
| 72 | 
             
            # embeddings.detach()
         | 
| 73 | 
             
            # embeddings.size torch.Size([1, 10, 768])
         | 
| 74 | 
            +
            #tensor([[[-0.0928,  0.0506, -0.0094,  ..., -0.2388,  0.1177, -0.1302],
         | 
| 75 | 
            +
            #         [ 0.0662,  0.1030, -0.2355,  ..., -0.4224, -0.0574, -0.2802],
         | 
| 76 | 
            +
            #         [-0.0729,  0.0547,  0.0192,  ..., -0.1743,  0.0998, -0.2677],
         | 
| 77 | 
             
            #         ...,
         | 
| 78 | 
             
            ```
         | 
| 79 |  | 
|  | |
| 81 | 
             
            ```python
         | 
| 82 | 
             
            from transformers import CamembertConfig
         | 
| 83 | 
             
            # (Need to reload the model with new config)
         | 
| 84 | 
            +
            config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
         | 
| 85 | 
            +
            camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)
         | 
| 86 |  | 
| 87 | 
             
            embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
         | 
| 88 | 
             
            #  all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
         | 
| 89 | 
             
            all_layer_embeddings[5]
         | 
| 90 | 
             
            # layer 5 contextual embedding : size torch.Size([1, 10, 768])
         | 
| 91 | 
            +
            #tensor([[[-0.0059, -0.0227,  0.0065,  ..., -0.0770,  0.0369,  0.0095],
         | 
| 92 | 
            +
            #         [ 0.2838, -0.1531, -0.3642,  ..., -0.0027, -0.8502, -0.7914],
         | 
| 93 | 
            +
            #         [-0.0073, -0.0338, -0.0011,  ...,  0.0533, -0.0250, -0.0061],
         | 
| 94 | 
             
            #         ...,
         | 
| 95 | 
             
            ```
         | 
| 96 |  | 
| 97 | 
            +
             | 
| 98 | 
            +
            ## Authors 
         | 
| 99 | 
            +
             | 
| 100 | 
            +
            CamemBERT was trained and evaluated by Louis Martin\*, Benjamin Muller\*, Pedro Javier Ortiz Suárez\*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
         | 
| 101 | 
            +
             | 
| 102 | 
            +
             | 
| 103 | 
            +
            ## Citation
         | 
| 104 | 
            +
            If you use our work, please cite:
         | 
| 105 | 
            +
             | 
| 106 | 
            +
            ```bibtex
         | 
| 107 | 
            +
            @inproceedings{martin2020camembert,
         | 
| 108 | 
            +
              title={CamemBERT: a Tasty French Language Model},
         | 
| 109 | 
            +
              author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
         | 
| 110 | 
            +
              booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
         | 
| 111 | 
            +
              year={2020}
         | 
| 112 | 
            +
            }
         | 
| 113 | 
            +
            ```
         | 

