CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
Abstract
New versions of CamemBERT, CamemBERTav2 and CamemBERTv2, address temporal concept drift using DeBERTaV3 and RoBERTa architectures, respectively, outperforming their predecessors across various NLP tasks.
French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.
Community
Really cool new CamemBERT(a) models and very interesting comparisons between RoBERTa and DeBERTa architecture.
E.g. on PoS Tagging they are on-par, but DeBERTa is generally slower on fine-tuning, so one would prefer CamemBERT here, but for NER the DeBERTa model goes off and heavily outperforms everything :)
Hey @wissamantoun , so great to see new improvements on the CamemBERT* family!!
Did u btw use the same code codebase as for training CanemBERTa (and the 128 + 512 sequence lenght two phase approach)?
Yes it's the same codebase, although i added more features and fixes to it. I'm currently working on code cleanup and will try to push all models and fine-tunes to huggingface asap.
This time i opted also for a two-phase training but with 512 then 1024. I could have done a three-phase one but i decided to go with two-phase just for simplicity.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tucano: Advancing Neural Text Generation for Portuguese (2024)
- Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs) (2024)
- Exploring transfer learning for Deep NLP systems on rarely annotated languages (2024)
- KyrgyzNLP: Challenges, Progress, and Future (2024)
- LLM for Everyone: Representing the Underrepresented in Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Hi @wissamantoun !
First of all, congratulations for these new models 🤩
I wanted to apply them to other datasets than those in the paper (I'm careful with academic datasets that may contain duplicate data or leaks).
I observed results similar to what is visible in your paper, namely a CamemBERTv2 model giving similar results to a CamemBERTv1 (overlapping confidence intervals), and a CamemBERTav2 performing better and then appearing as the new state of the art.
QA results are available here, and NER results here (PER, LOC, ORG) & here (PER, LOC, ORG, MISC). I've also tested binary classification on a cleaned-up version of allocine, which I haven't put online as I don't really see the point of offering such a model as open-source (+0.15 point gained).
In the case of QA and binary classification, CamemBERTav2 does better than CamemBERT-large.
I'm writing here because I have three questions following my reading of the paper on which I'd like to have your insight/opinion.
- In terms of the number of tokens seen, if my calculations based on table 6 are correct, we are on 1,287,651,328,000 tokens for CamemBERTv2 (273K steps x sequence of 512 tokens x batch size of 8,192 + 17K steps x sequence of 1024 tokens x batch sizeof 8,192). 
 CamemBERTv1 saw 419,430,400,000 (100,000 steps x sequence of 512 tokens x batch size of 8,192).
 CamemBERTv2 has therefore seen 3.07 times more data than CamemBERTv1, and is no more efficient.
 Do you have any idea why? Are the data insufficient in quality? Is this an extreme confirmation of section 6 “6 Impact of corpus origin and size” of the CamemBERTv1 paper?
- CamemBERTav2 saw 524,288,000,000 tokens (91K steps x sequence of 512 tokens x batch size of 8,192 + 17K steps x sequence of 1024 tokens x batch size of 8,192), i.e. 1.25 times more tokens than CamemBERTv1. 
 This model performs better. A conclusion that could be a superiority of the deBERTa architecture over the roBERTa as according to the previous point data would not be the determining factor. Would you have checkpoint of CamemBERTav2 at a number of tokens equivalent to CamemBERTv1 to be able to compare the performance of these two models fairly?
- With what computing resources were the two models pretrained? I suspect it was run on the Jean Zay from the “Acknowledgements” section, but there's no mention of the type of GPUs, their number, or the duration of the training. My wish here is to be able to establish an estimate of CO2 emissions. 
Thanks in advance for your answers, and congratulations and thanks again for your work 🙏
Hey,
Thanks for putting your finetunes online, I'm happy to see that the models are performing well.
To answer your questions:
1- The number of tokens that I put is based on tokens statistics from the pertaining dataset which ignore the padding token present when you assume you have 512 tokens per example. The dataset s that we used had approx 275B non-special tokens which I counted after doing tokenization for training with 512 seq length. I think I forgot to add to the paper the statistics of phase 2 of the training with the longer 1024 seq. len, I'll add this in the final version soon. 
Going back to the question yes, I think the BERT MLM loss, architecture and model size have hit a wall with data size. The three epochs were motivated by our use of a 40% masking rate, so we needed at least 3 epochs to kind of make sure that all tokens were masked and trained on. But with ELECTRA style pre-training the model trains on all the tokens at once and hence way more sample efficient.
2- Yes, I'm finishing the uploads of all pre-training checkpoints and their conversion into TF and PT models, should be up in the weekend.
3- We trained the models using 16 H100s. Phase 1 for CamemBERTav2 took ~3.5 days, and Phase 2 also took ~3.5 days to do 1 epoch. The CamemBERTv2 was much faster, and did three epochs in approx the same time. This is due to some inefficiency with DeBERTa in BF16 on the TF2. You can check the duration in the tensorboard logs when you check wall time
I hadn't paid attention to the tensorboard logs, I'll take a good look at them.
Thanks for the answers 🤗
Hi all,
not related to the great new v2 models, but I want to add a comment on this:
Going back to the question yes, I think the BERT MLM loss, architecture and model size have hit a wall with data size.
I could also see this behavior with BERT in my experiments on larger corpora. E.g. I trained also a BERT variant of my "Journaux-LM", that consists of Public Domain French newspaper with a total dataset size of ~400GB. The BERT model performs worse (1% on different NER benchmarks) compared to an ELECTRA-like model, whereas the BERT model has seen around 50% more subtokens during training. Pretraining data and even tokenizer were the same for both BERT and ELECTRA-like model.
And when I would pretrain a DeBERTa model, I would expect more performance boosts.
Here's all the checkpoints for the CamemBERTv2 and CamemBERTaV2 models:
Models citing this paper 31
Browse 31 models citing this paperDatasets citing this paper 0
No dataset linking this paper
 Stefan Schweter
							Stefan Schweter 
					 
					 
					 
					 
					 
						 
						