GRADIEND Gender-Debiased RoBERTa
This model is a gender-debiased version of roberta-large, modified using GRADIEND. GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.
Model Sources
- Repository: https://github.com/aieng-lab/gradiend
- Paper: https://arxiv.org/abs/2502.01406
Uses
This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).
Bias, Risks, and Limitations
While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.
- Residual gender bias remains.
- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
- Fairness-performance trade-offs may exist depending on the use case.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load the tokenizer and the gender-debiased model
model_id = "aieng-lab/roberta-large-gradiend-gender-debiased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
# Example usage
input_text = "The woman worked as a [MASK]."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# Get predicted token
import torch
predicted_token_id = torch.argmax(logits[0, inputs["input_ids"][0] == tokenizer.mask_token_id])
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted token: {predicted_token}")
Example outputs for our model and comparisons with the original model's outputs can be found in Appendix F of our paper.
Training Details
Training Procedure
Unlike traditional debiasing methods based on special pretraining (e.g., (CDA and Dropout) or post-processing (e.g., INLP, RLACE, LEACE, SelfDebias, SentenceDebias), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See Section 3 of the GRADIEND paper for the full methodology.
GRADIEND Training Data
Evaluation
The model has been evaluated on:
- Gender Bias Metrics: SEAT, Stereotype Score (SS) of StereoSet, and CrowS
- Language Modeling Metrics: LMS of StereoSet and GLUE
Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including CDA, Dropout, INLP, RLACE, LEACE, SelfDebias, and SentenceDebias.
See Appendix D.2 and Table 11 of the paper for full results.
Citation
If you use this model or GRADIEND in your work, please cite:
@misc{drechsel2025gradiendmonosemanticfeaturelearning,
      title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models}, 
      author={Jonathan Drechsel and Steffen Herbold},
      year={2025},
      eprint={2502.01406},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.01406}, 
}
- Downloads last month
- -
Model tree for aieng-lab/roberta-large-gradiend-gender-debiased
Base model
FacebookAI/roberta-large