bragee commited on
Commit
e080b22
·
verified ·
1 Parent(s): c80425a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - f1
7
+ - accuracy
8
+ base_model:
9
+ - google-bert/bert-base-uncased
10
+ pipeline_tag: text-classification
11
+ library_name: transformers
12
+ tags:
13
+ - hate-speech-detection,
14
+ - explainability
15
+ - attention
16
+ - interpretable -ml
17
+ - AAAI2026
18
+ ---
19
+
20
+ # SRA-BERT for Hate Speech Detection
21
+
22
+ **Supervised Rational Attention (SRA)** fine-tuned BERT model for explainable hate speech detection.
23
+
24
+ ## Abstract
25
+
26
+ The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. Empirically, SRA achieves **2.4× better explainability** compared to current baselines, and produces token-level explanations that are more faithful and human-aligned.
27
+
28
+ 📄 [Read the paper on arXiv](https://arxiv.org/abs/2511.07065)
29
+
30
+ 📄 **Paper:** *"Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection"*
31
+ 🎯 **Accepted:** AAAI-26 AI Alignment Track
32
+ 🔗 **Demo:** [Live Demo on HuggingFace Spaces](https://huggingface.co/spaces/bragee/sra-hate-speech-demo)
33
+
34
+ ## Model Description
35
+
36
+ This model is a `bert-base-uncased` classifier fine-tuned on the [HateXplain](https://github.com/hate-alert/HateXplain) dataset with **Supervised Rational Attention (SRA)** — a training method that aligns the model's attention weights with human-annotated rationales.
37
+
38
+ ### Key Innovation
39
+
40
+ Standard BERT attention weights don't reliably explain predictions. SRA supervises a specific attention head (Layer 8, Head 7) to attend to the same tokens that human annotators identified as evidence for their labeling decisions.
41
+
42
+ **Result:** Attention weights that actually explain the model's decisions, achieving **2.4× better alignment** with human rationales while maintaining classification performance.
43
+
44
+ ## Labels
45
+
46
+ | Label | ID | Description |
47
+ |-------|-----|-------------|
48
+ | Normal | 0 | Non-hateful, non-offensive content |
49
+ | Offensive | 1 | Offensive but not hate speech |
50
+ | Hate Speech | 2 | Hate speech targeting protected groups |
51
+
52
+ ## Usage
53
+ ```python
54
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
55
+ import torch
56
+
57
+ model_name = "bragee/sra-hate-speech-bert"
58
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
59
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
60
+
61
+ text = "I love spending time with my friends"
62
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
63
+
64
+ with torch.no_grad():
65
+ outputs = model(**inputs)
66
+ probs = torch.softmax(outputs.logits, dim=1)
67
+ pred = torch.argmax(probs, dim=1).item()
68
+
69
+ labels = ["Normal", "Offensive", "Hate Speech"]
70
+ print(f"Prediction: {labels[pred]} ({probs[0][pred]:.1%})")
71
+ ```
72
+
73
+ ### Extracting Attention-Based Explanations
74
+
75
+ The key feature of this model is interpretable attention. Extract attention from the supervised head:
76
+ ```python
77
+ with torch.no_grad():
78
+ outputs = model(**inputs, output_attentions=True)
79
+
80
+ # Extract attention from Layer 8, Head 7 (the supervised head)
81
+ attention = outputs.attentions[8][:, 7, :, :] # (batch, seq, seq)
82
+ attention_weights = attention.mean(dim=1) # Average over query positions
83
+
84
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
85
+ for token, weight in zip(tokens, attention_weights[0]):
86
+ if token not in ["[CLS]", "[SEP]", "[PAD]"]:
87
+ print(f"{token}: {weight:.3f}")
88
+ ```
89
+
90
+ ## Training Details
91
+
92
+ ### Training Data
93
+
94
+ - **Dataset:** [HateXplain](https://github.com/hate-alert/HateXplain)
95
+ - **Size:** ~20,000 posts from Twitter and Gab
96
+ - **Annotations:** 3-class labels + token-level rationales from multiple annotators
97
+
98
+ ### Training Procedure
99
+
100
+ - **Base model:** `bert-base-uncased`
101
+ - **Epochs:** 5
102
+ - **Batch size:** 16
103
+ - **Learning rate:** 2e-5
104
+ - **Max sequence length:** 128
105
+
106
+ ### SRA Configuration
107
+
108
+ - **Supervised attention head:** Layer 8, Head 7
109
+ - **Attention loss weight (α):** 10.0
110
+ - **Loss function:** Cross-entropy + α × MSE(attention, rationale)
111
+
112
+ The MSE loss is only computed for offensive/hate speech examples where human rationales exist.
113
+
114
+ ## Evaluation
115
+
116
+ Evaluated on the HateXplain test set:
117
+
118
+ | Metric | Score |
119
+ |--------|-------|
120
+ | Macro F1 | 0.68 |
121
+ | Accuracy | 0.70 |
122
+ | Attention-Rationale Alignment | 2.4× baseline |
123
+
124
+ ## Intended Use
125
+
126
+ - **Primary use:** Research on explainable AI and hate speech detection
127
+ - **Demo/educational:** Understanding how attention-based explanations work
128
+ - **Content moderation research:** Studying interpretable classifiers
129
+
130
+ ### Limitations
131
+
132
+ - Trained on English Twitter/Gab data — may not generalize to other platforms or languages
133
+ - Attention explanations are post-hoc interpretations, not guaranteed causal explanations
134
+ - Should not be used as sole arbiter for content moderation decisions
135
+
136
+ ### Ethical Considerations
137
+
138
+ This model is intended for research purposes. Hate speech detection systems can have false positives that disproportionately affect marginalized groups. Human review should always be part of any content moderation pipeline.
139
+
140
+ ## Citation
141
+ ```bibtex
142
+ @article{eilertsen2025sra,
143
+ title={Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection},
144
+ author={Eilertsen, Brage and Bjørgfinsdóttir, Røskva and Vargas, Francielle and Ramezani-Kebrya, Ali},
145
+ journal={arXiv preprint arXiv:2511.07065},
146
+ year={2025},
147
+ note={Accepted at AAAI-26}
148
+ }
149
+ ```
150
+
151
+ ## Model Card Contact
152
+
153
+ For questions about this model, please open an issue on the [model repository](https://huggingface.co/bragee/sra-hate-speech-bert).