Model Card for Model ID

This model and its tokenizer were fully pretrained on Portuguese text. I don't have time right now to write more about the training process. Contact me at elias.jacob at ufrn.br if you need some info before I have the time to publish something more detailed. The training data was the cleaned Portuguese subset of the HPLT V3 dataset. I've followed the almost the same training recipe as in the original paper, but training longer (2_000_000_000_000 tokens during pretraining + 300_000_000_000 tokens during context extension + 70_000_000_000 tokens during the learning rate decay phase)

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM


model_id = "eliasjacob/ModernBERT-base-portuguese"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

device = torch.device("cpu")
model.to(device)

text = "O código penal brasileiro estabelece, em seu artigo [MASK], o crime de homicídio"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
Predicted token: 121
Downloads last month
26
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eliasjacob/ModernBERT-base-portuguese

Finetunes
1 model