KevynKrancenblum commited on
Commit
e127b6a
·
verified ·
1 Parent(s): b22be15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +261 -31
README.md CHANGED
@@ -3,50 +3,269 @@ base_model: dicta-il/dictalm2.0-instruct
3
  library_name: peft
4
  model_name: offensive_v5_dpo
5
  tags:
6
- - base_model:adapter:dicta-il/dictalm2.0-instruct
7
  - dpo
8
  - lora
9
  - transformers
10
  - trl
11
- licence: license
12
- pipeline_tag: text-generation
 
 
 
 
 
 
 
13
  ---
14
 
15
- # Model Card for offensive_v5_dpo
16
 
17
- This model is a fine-tuned version of [dicta-il/dictalm2.0-instruct](https://huggingface.co/dicta-il/dictalm2.0-instruct).
18
- It has been trained using [TRL](https://github.com/huggingface/trl).
19
 
20
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ```python
23
- from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
26
- generator = pipeline("text-generation", model="None", device="cuda")
27
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
28
- print(output["generated_text"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
 
31
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/kevynkrancenblum-sami-shamoon/huggingface/runs/ep1pizjj)
 
 
 
 
34
 
 
 
 
 
 
 
 
 
 
35
 
36
- This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
 
 
 
 
 
 
 
 
37
 
38
- ### Framework versions
39
 
40
- - PEFT 0.17.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  - TRL: 0.21.0
42
  - Transformers: 4.55.2
43
- - Pytorch: 2.6.0+cu124
44
  - Datasets: 4.0.0
45
  - Tokenizers: 0.21.4
 
 
 
 
 
 
 
46
 
47
- ## Citations
48
 
49
- Cite DPO as:
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ```bibtex
52
  @inproceedings{rafailov2023direct,
@@ -54,20 +273,31 @@ Cite DPO as:
54
  author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
55
  year = 2023,
56
  booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
57
- url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
58
- editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
59
  }
60
  ```
61
 
62
- Cite TRL as:
63
-
64
  ```bibtex
65
  @misc{vonwerra2022trl,
66
- title = {{TRL: Transformer Reinforcement Learning}},
67
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
68
- year = 2020,
69
- journal = {GitHub repository},
70
- publisher = {GitHub},
71
- howpublished = {\url{https://github.com/huggingface/trl}}
72
  }
73
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  library_name: peft
4
  model_name: offensive_v5_dpo
5
  tags:
 
6
  - dpo
7
  - lora
8
  - transformers
9
  - trl
10
+ - hebrew
11
+ - offensive-language-detection
12
+ - content-moderation
13
+ - explainable-ai
14
+ - reasoning
15
+ license: mit
16
+ language:
17
+ - he
18
+ pipeline_tag: text-classification
19
  ---
20
 
21
+ # Hebrew Offensive Language Detection with Reasoning (offensive_v5_dpo)
22
 
23
+ This model is a fine-tuned version of [dicta-il/dictalm2.0-instruct](https://huggingface.co/dicta-il/dictalm2.0-instruct) specialized for **detecting offensive language in Hebrew text** while providing **explainable rationales** in Hebrew.
 
24
 
25
+ **Model Repository:** [KevynKrancenblum/hebrew-offensive-detection](https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection)
26
+
27
+ ## What Does This Model Do?
28
+
29
+ This model performs **binary classification** of Hebrew text to determine if it contains offensive language, with the unique capability of **explaining its reasoning** in Hebrew. It addresses critical challenges in Hebrew NLP:
30
+
31
+ ### Key Capabilities
32
+
33
+ 1. **Offensive Language Detection**: Classifies Hebrew text as offensive (label: 1) or non-offensive (label: 0)
34
+ 2. **Explainable Predictions**: Generates Hebrew rationales explaining why text is classified as offensive or not
35
+ 3. **Cultural Awareness**: Fine-tuned on Hebrew-specific offensive patterns including:
36
+ - Cultural insults and slurs (קללות)
37
+ - Political and ethnic hate speech (הסתה)
38
+ - Threats and aggressive language (איומים)
39
+ - Context-dependent offensiveness in Israeli discourse
40
+
41
+ ### Performance Metrics
42
+
43
+ | Dataset | Accuracy | Precision | Recall | F1-Score |
44
+ |---------|----------|-----------|--------|----------|
45
+ | OlaH-5000 (test) | **0.85** | **0.85** | **0.85** | **0.85** |
46
+ | HeDetox (cross-domain) | **0.91** | **0.92** | **0.91** | **0.91** |
47
+
48
+ **Comparison with baselines:**
49
+ - AlephBERT (fine-tuned): 0.84 F1 (no explanations)
50
+ - heBERT (fine-tuned): 0.85 F1 (no explanations)
51
+ - GPT-5 (zero-shot): 0.77 F1 (lacks Hebrew cultural grounding)
52
+
53
+ ## Quick Start
54
+
55
+ ### Installation
56
+
57
+ ```bash
58
+ pip install transformers torch peft bitsandbytes accelerate
59
+ ```
60
+
61
+ ### Basic Usage
62
 
63
  ```python
64
+ from transformers import AutoModelForCausalLM, AutoTokenizer
65
+ import torch
66
+
67
+ # Load model and tokenizer
68
+ model_name = "KevynKrancenblum/hebrew-offensive-detection"
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
71
+ model = AutoModelForCausalLM.from_pretrained(
72
+ model_name,
73
+ load_in_4bit=True, # Use 4-bit quantization for efficiency
74
+ device_map="auto"
75
+ )
76
+
77
+ # Prepare system prompt in Hebrew
78
+ SYSTEM_PROMPT = """אתה מומחה לזיהוי תוכן פוגעני בעברית. נתח את הטקסט הבא והסבר את הנימוק שלך.
79
+ בהתבסס על הנימוק, תן תווית: 1 לפוגעני או 0 ללא פוגעני."""
80
+
81
+ # Classification function
82
+ def classify_hebrew_text(text: str) -> dict:
83
+ prompt = f"{SYSTEM_PROMPT}\n\nטקסט: \"{text}\""
84
 
85
+ messages = [{"role": "user", "content": prompt}]
86
+ input_text = tokenizer.apply_chat_template(messages, tokenize=False)
87
+
88
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
89
+
90
+ outputs = model.generate(
91
+ **inputs,
92
+ max_new_tokens=256,
93
+ temperature=0.2,
94
+ do_sample=True,
95
+ pad_token_id=tokenizer.eos_token_id
96
+ )
97
+
98
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
99
+
100
+ # Parse response
101
+ lines = response.split('\n')
102
+ label = None
103
+ reason = None
104
+
105
+ for line in lines:
106
+ if 'תווית:' in line or 'label:' in line.lower():
107
+ # Extract label (0 or 1)
108
+ if '1' in line and 'פוגעני' in line:
109
+ label = 1
110
+ elif '0' in line:
111
+ label = 0
112
+ elif len(line.strip()) > 10 and label is None:
113
+ # Rationale is typically the longer text after label
114
+ reason = line.strip()
115
+
116
+ return {
117
+ "label": label, # 1 = offensive, 0 = non-offensive
118
+ "reason": reason, # Hebrew explanation
119
+ "full_response": response
120
+ }
121
+
122
+ # Example usage
123
+ text = "יא מטומטם, לך תמות"
124
+ result = classify_hebrew_text(text)
125
+
126
+ print(f"Label: {result['label']}")
127
+ print(f"Reason: {result['reason']}")
128
  ```
129
 
130
+ ### Example Output
131
+
132
+ **Input:** "יא מטומטם, לך תמות"
133
+
134
+ **Output:**
135
+ ```
136
+ Label: 1 (Offensive)
137
+ Reason: הטקסט מכיל קללה ("מטומטם") ואיום ("לך תמות"), שניהם ביטויים פוגעניים המטרתם להשפיל ולאיים.
138
+ ```
139
+
140
+ **Translation:** "The text contains an insult ('idiot') and a threat ('go die'), both offensive expressions intended to humiliate and threaten."
141
+
142
+ ## Training Methodology
143
+
144
+ ### Three-Stage Alignment Pipeline
145
+
146
+ This model was developed through a sophisticated **three-stage training process** combining teacher-student learning with preference optimization:
147
 
148
+ #### Stage 1: Teacher-Generated Reasoning Supervision
149
+ - **Teacher Model:** GPT-5 (gpt-5-preview)
150
+ - **Task:** Generate high-quality Hebrew rationales explaining offensive/non-offensive classifications
151
+ - **Dataset:** ~8,000 annotated samples from OlaH-5000
152
+ - **Output:** Structured reasoning corpus in Hebrew
153
 
154
+ #### Stage 2: Supervised Fine-Tuning (SFT)
155
+ - **Base Model:** DictaLM-2.0-Instruct (7B parameters, Mistral architecture)
156
+ - **Method:** Parameter-Efficient Fine-Tuning (PEFT) using QLoRA
157
+ - **Training Details:**
158
+ - LoRA adapters: rank=256, alpha=512
159
+ - 4-bit quantization (bitsandbytes)
160
+ - Chain-of-thought supervision (model learns to generate rationale → label)
161
+ - Training time: ~12 hours on RTX 4080 SUPER (16GB VRAM)
162
+ - **Results:** 74% F1 (improved neutrality handling)
163
 
164
+ #### Stage 3: Direct Preference Optimization (DPO)
165
+ - **Method:** Iterative DPO alignment without reward model
166
+ - **Preference Pairs:**
167
+ - **Chosen:** GPT-5 teacher rationale (correct label + explanation)
168
+ - **Rejected:** GPT-5-mini rationale (incorrect label + plausible but wrong explanation)
169
+ - **Three Iterations:**
170
+ - Round 1: 80% F1 (balanced precision-recall)
171
+ - Round 2: 82% F1 (refined calibration)
172
+ - **Round 3 (this model): 85% F1** (optimal performance, stable explanations)
173
 
174
+ ### Why DPO?
175
 
176
+ Direct Preference Optimization was chosen over traditional RLHF/PPO because:
177
+ - ✅ No separate reward model required
178
+ - ✅ Computationally efficient (trainable on consumer GPUs)
179
+ - ✅ Single-stage optimization
180
+ - ✅ Comparable or superior performance to full RLHF
181
+ - ✅ More stable training dynamics
182
+
183
+ ### Training Configuration
184
+
185
+ **Hardware:**
186
+ - Single NVIDIA RTX 4080 SUPER (16GB VRAM)
187
+ - Total training time: ~32 hours (all stages)
188
+
189
+ **Hyperparameters:**
190
+ - Epochs: 50 (SFT), 3 (DPO iterations)
191
+ - Batch size: 2 per device, gradient accumulation: 16 (effective batch = 32)
192
+ - Learning rate: 2×10⁻⁵ (linear warmup)
193
+ - Max sequence length: 512 tokens
194
+ - Precision: bfloat16
195
+ - Optimizer: AdamW
196
+
197
+ **Memory Optimization:**
198
+ - QLoRA reduces memory from ~28GB (FP16) to <7GB (4-bit)
199
+ - Gradient checkpointing enabled
200
+ - LoRA adapters: ~67M trainable parameters (~0.96% of base model)
201
+
202
+ ## Use Cases
203
+
204
+ This model is designed for:
205
+
206
+ 1. **Content Moderation**: Automated detection of offensive content in Hebrew social media, forums, and comment sections
207
+ 2. **Educational Tools**: Teaching about offensive language patterns with explainable feedback
208
+ 3. **Research**: Studying Hebrew offensive language and cultural hate speech patterns
209
+ 4. **Compliance**: Helping platforms enforce community guidelines in Hebrew
210
+
211
+ ## Datasets Used
212
+
213
+ - **OlaH-5000**: Primary training dataset for Hebrew offensive language
214
+ - **HeDetox**: Cross-domain evaluation dataset for Hebrew text detoxification
215
+
216
+ ## Limitations
217
+
218
+ - **Slang and Youth Language**: May struggle with emerging slang, metaphorical insults, or internet-specific Hebrew
219
+ - **Spelling Variations**: Performance degrades with unconventional spellings or corrupted text
220
+ - **Domain Specificity**: Optimized for social media text (Twitter/Facebook style)
221
+ - **Cultural Subjectivity**: Inherits biases from training data annotations
222
+ - **Context Length**: Limited to 512 tokens (may miss context in very long texts)
223
+
224
+ ## Ethical Considerations
225
+
226
+ ⚠️ **Important:** This model reflects cultural and contextual interpretations of offensiveness in Israeli Hebrew discourse. Classifications should be:
227
+ - Used as **decision support**, not sole determinant
228
+ - Combined with **human review** for sensitive moderation decisions
229
+ - Regularly evaluated for **bias and fairness**
230
+ - Contextualized to specific use cases and communities
231
+
232
+ ## Training Procedure
233
+
234
+ This model was trained with **Direct Preference Optimization (DPO)**, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
235
+
236
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/kevynkrancenblum-sami-shamoon/huggingface/runs/ep1pizjj)
237
+
238
+ ### Framework Versions
239
+
240
+ - PEFT: 0.17.0
241
  - TRL: 0.21.0
242
  - Transformers: 4.55.2
243
+ - PyTorch: 2.6.0+cu124
244
  - Datasets: 4.0.0
245
  - Tokenizers: 0.21.4
246
+ - bitsandbytes: (4-bit quantization)
247
+
248
+ ## Repository and Resources
249
+
250
+ - **GitHub Repository:** [KevynKrancenblum/hebrew-offensive-detection](https://github.com/KevynKrancenblum/hebrew-offensive-detection)
251
+ - **Interactive Demo:** Streamlit web interface included in repository
252
+ - **Documentation:** Comprehensive README with usage examples
253
 
254
+ ## Citation
255
 
256
+ If you use this model in your research, please cite:
257
+
258
+ ```bibtex
259
+ @mastersthesis{krancenblum2025hebrew,
260
+ title={Developing Reasoning-Augmented Language Models for Hebrew Offensive Language Detection},
261
+ author={Krancenblum, Kevyn},
262
+ year={2025},
263
+ school={Sami Shamoon College of Engineering},
264
+ note={Model: https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection}
265
+ }
266
+ ```
267
+
268
+ ### Cite DPO Method
269
 
270
  ```bibtex
271
  @inproceedings{rafailov2023direct,
 
273
  author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
274
  year = 2023,
275
  booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
276
+ url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
 
277
  }
278
  ```
279
 
280
+ ### Cite TRL Framework
281
+
282
  ```bibtex
283
  @misc{vonwerra2022trl,
284
+ title = {{TRL: Transformer Reinforcement Learning}},
285
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
286
+ year = 2020,
287
+ journal = {GitHub repository},
288
+ publisher = {GitHub},
289
+ howpublished = {\url{https://github.com/huggingface/trl}}
290
  }
291
+ ```
292
+
293
+ ## License
294
+
295
+ MIT License - See LICENSE file for details
296
+
297
+ ## Acknowledgments
298
+
299
+ - **Dicta Research Center** for DictaLM-2.0-Instruct base model
300
+ - **OpenAI** for GPT-5 teacher supervision
301
+ - **Hugging Face** for model hosting and transformers library
302
+ - **OlaH-5000** and **HeDetox** dataset creators
303
+ - **TRL Team** for Direct Preference Optimization implementation