KevynKrancenblum
/

hebrew-offensive-detection

@@ -3,50 +3,269 @@ base_model: dicta-il/dictalm2.0-instruct
 library_name: peft
 model_name: offensive_v5_dpo
 tags:
-- base_model:adapter:dicta-il/dictalm2.0-instruct
 - dpo
 - lora
 - transformers
 - trl
-licence: license
-pipeline_tag: text-generation
 ---
-# Model Card for offensive_v5_dpo
-This model is a fine-tuned version of [dicta-il/dictalm2.0-instruct](https://huggingface.co/dicta-il/dictalm2.0-instruct).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
 ```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="None", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
 ```
-## Training procedure
-[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/kevynkrancenblum-sami-shamoon/huggingface/runs/ep1pizjj)
-This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
-### Framework versions
-- PEFT 0.17.0
 - TRL: 0.21.0
 - Transformers: 4.55.2
-- Pytorch: 2.6.0+cu124
 - Datasets: 4.0.0
 - Tokenizers: 0.21.4
-## Citations
-Cite DPO as:
 ```bibtex
 @inproceedings{rafailov2023direct,
@@ -54,20 +273,31 @@ Cite DPO as:
     author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
     year         = 2023,
     booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
-    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
-    editor       = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
 }
 ```
-Cite TRL as:
 ```bibtex
 @misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
 }
-```

 library_name: peft
 model_name: offensive_v5_dpo
 tags:
 - dpo
 - lora
 - transformers
 - trl
+- hebrew
+- offensive-language-detection
+- content-moderation
+- explainable-ai
+- reasoning
+license: mit
+language:
+- he
+pipeline_tag: text-classification
 ---
+# Hebrew Offensive Language Detection with Reasoning (offensive_v5_dpo)
+This model is a fine-tuned version of [dicta-il/dictalm2.0-instruct](https://huggingface.co/dicta-il/dictalm2.0-instruct) specialized for **detecting offensive language in Hebrew text** while providing **explainable rationales** in Hebrew.
+**Model Repository:** [KevynKrancenblum/hebrew-offensive-detection](https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection)
+## What Does This Model Do?
+This model performs **binary classification** of Hebrew text to determine if it contains offensive language, with the unique capability of **explaining its reasoning** in Hebrew. It addresses critical challenges in Hebrew NLP:
+### Key Capabilities
+1. **Offensive Language Detection**: Classifies Hebrew text as offensive (label: 1) or non-offensive (label: 0)
+2. **Explainable Predictions**: Generates Hebrew rationales explaining why text is classified as offensive or not
+3. **Cultural Awareness**: Fine-tuned on Hebrew-specific offensive patterns including:
+   - Cultural insults and slurs (קללות)
+   - Political and ethnic hate speech (הסתה)
+   - Threats and aggressive language (איומים)
+   - Context-dependent offensiveness in Israeli discourse
+### Performance Metrics
+| Dataset | Accuracy | Precision | Recall | F1-Score |
+|---------|----------|-----------|--------|----------|
+| OlaH-5000 (test) | **0.85** | **0.85** | **0.85** | **0.85** |
+| HeDetox (cross-domain) | **0.91** | **0.92** | **0.91** | **0.91** |
+**Comparison with baselines:**
+- AlephBERT (fine-tuned): 0.84 F1 (no explanations)
+- heBERT (fine-tuned): 0.85 F1 (no explanations)
+- GPT-5 (zero-shot): 0.77 F1 (lacks Hebrew cultural grounding)
+## Quick Start
+### Installation
+```bash
+pip install transformers torch peft bitsandbytes accelerate
+```
+### Basic Usage
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+# Load model and tokenizer
+model_name = "KevynKrancenblum/hebrew-offensive-detection"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    load_in_4bit=True,  # Use 4-bit quantization for efficiency
+    device_map="auto"
+)
+# Prepare system prompt in Hebrew
+SYSTEM_PROMPT = """אתה מומחה לזיהוי תוכן פוגעני בעברית. נתח את הטקסט הבא והסבר את הנימוק שלך.
+בהתבסס על הנימוק, תן תווית: 1 לפוגעני או 0 ללא פוגעני."""
+# Classification function
+def classify_hebrew_text(text: str) -> dict:
+    prompt = f"{SYSTEM_PROMPT}\n\nטקסט: \"{text}\""
+    messages = [{"role": "user", "content": prompt}]
+    input_text = tokenizer.apply_chat_template(messages, tokenize=False)
+    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=256,
+        temperature=0.2,
+        do_sample=True,
+        pad_token_id=tokenizer.eos_token_id
+    )
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    # Parse response
+    lines = response.split('\n')
+    label = None
+    reason = None
+    for line in lines:
+        if 'תווית:' in line or 'label:' in line.lower():
+            # Extract label (0 or 1)
+            if '1' in line and 'פוגעני' in line:
+                label = 1
+            elif '0' in line:
+                label = 0
+        elif len(line.strip()) > 10 and label is None:
+            # Rationale is typically the longer text after label
+            reason = line.strip()
+    return {
+        "label": label,  # 1 = offensive, 0 = non-offensive
+        "reason": reason,  # Hebrew explanation
+        "full_response": response
+    }
+# Example usage
+text = "יא מטומטם, לך תמות"
+result = classify_hebrew_text(text)
+print(f"Label: {result['label']}")
+print(f"Reason: {result['reason']}")
 ```
+### Example Output
+**Input:** "יא מטומטם, לך תמות"
+**Output:**
+```
+Label: 1 (Offensive)
+Reason: הטקסט מכיל קללה ("מטומטם") ואיום ("לך תמות"), שניהם ביטויים פוגעניים המטרתם להשפיל ולאיים.
+```
+**Translation:** "The text contains an insult ('idiot') and a threat ('go die'), both offensive expressions intended to humiliate and threaten."
+## Training Methodology
+### Three-Stage Alignment Pipeline
+This model was developed through a sophisticated **three-stage training process** combining teacher-student learning with preference optimization:
+#### Stage 1: Teacher-Generated Reasoning Supervision
+- **Teacher Model:** GPT-5 (gpt-5-preview)
+- **Task:** Generate high-quality Hebrew rationales explaining offensive/non-offensive classifications
+- **Dataset:** ~8,000 annotated samples from OlaH-5000
+- **Output:** Structured reasoning corpus in Hebrew
+#### Stage 2: Supervised Fine-Tuning (SFT)
+- **Base Model:** DictaLM-2.0-Instruct (7B parameters, Mistral architecture)
+- **Method:** Parameter-Efficient Fine-Tuning (PEFT) using QLoRA
+- **Training Details:**
+  - LoRA adapters: rank=256, alpha=512
+  - 4-bit quantization (bitsandbytes)
+  - Chain-of-thought supervision (model learns to generate rationale → label)
+  - Training time: ~12 hours on RTX 4080 SUPER (16GB VRAM)
+- **Results:** 74% F1 (improved neutrality handling)
+#### Stage 3: Direct Preference Optimization (DPO)
+- **Method:** Iterative DPO alignment without reward model
+- **Preference Pairs:**
+  - **Chosen:** GPT-5 teacher rationale (correct label + explanation)
+  - **Rejected:** GPT-5-mini rationale (incorrect label + plausible but wrong explanation)
+- **Three Iterations:**
+  - Round 1: 80% F1 (balanced precision-recall)
+  - Round 2: 82% F1 (refined calibration)
+  - **Round 3 (this model): 85% F1** (optimal performance, stable explanations)
+### Why DPO?
+Direct Preference Optimization was chosen over traditional RLHF/PPO because:
+- ✅ No separate reward model required
+- ✅ Computationally efficient (trainable on consumer GPUs)
+- ✅ Single-stage optimization
+- ✅ Comparable or superior performance to full RLHF
+- ✅ More stable training dynamics
+### Training Configuration
+**Hardware:**
+- Single NVIDIA RTX 4080 SUPER (16GB VRAM)
+- Total training time: ~32 hours (all stages)
+**Hyperparameters:**
+- Epochs: 50 (SFT), 3 (DPO iterations)
+- Batch size: 2 per device, gradient accumulation: 16 (effective batch = 32)
+- Learning rate: 2×10⁻⁵ (linear warmup)
+- Max sequence length: 512 tokens
+- Precision: bfloat16
+- Optimizer: AdamW
+**Memory Optimization:**
+- QLoRA reduces memory from ~28GB (FP16) to <7GB (4-bit)
+- Gradient checkpointing enabled
+- LoRA adapters: ~67M trainable parameters (~0.96% of base model)
+## Use Cases
+This model is designed for:
+1. **Content Moderation**: Automated detection of offensive content in Hebrew social media, forums, and comment sections
+2. **Educational Tools**: Teaching about offensive language patterns with explainable feedback
+3. **Research**: Studying Hebrew offensive language and cultural hate speech patterns
+4. **Compliance**: Helping platforms enforce community guidelines in Hebrew
+## Datasets Used
+- **OlaH-5000**: Primary training dataset for Hebrew offensive language
+- **HeDetox**: Cross-domain evaluation dataset for Hebrew text detoxification
+## Limitations
+- **Slang and Youth Language**: May struggle with emerging slang, metaphorical insults, or internet-specific Hebrew
+- **Spelling Variations**: Performance degrades with unconventional spellings or corrupted text
+- **Domain Specificity**: Optimized for social media text (Twitter/Facebook style)
+- **Cultural Subjectivity**: Inherits biases from training data annotations
+- **Context Length**: Limited to 512 tokens (may miss context in very long texts)
+## Ethical Considerations
+⚠️ **Important:** This model reflects cultural and contextual interpretations of offensiveness in Israeli Hebrew discourse. Classifications should be:
+- Used as **decision support**, not sole determinant
+- Combined with **human review** for sensitive moderation decisions
+- Regularly evaluated for **bias and fairness**
+- Contextualized to specific use cases and communities
+## Training Procedure
+This model was trained with **Direct Preference Optimization (DPO)**, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
+[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/kevynkrancenblum-sami-shamoon/huggingface/runs/ep1pizjj)
+### Framework Versions
+- PEFT: 0.17.0
 - TRL: 0.21.0
 - Transformers: 4.55.2
+- PyTorch: 2.6.0+cu124
 - Datasets: 4.0.0
 - Tokenizers: 0.21.4
+- bitsandbytes: (4-bit quantization)
+## Repository and Resources
+- **GitHub Repository:** [KevynKrancenblum/hebrew-offensive-detection](https://github.com/KevynKrancenblum/hebrew-offensive-detection)
+- **Interactive Demo:** Streamlit web interface included in repository
+- **Documentation:** Comprehensive README with usage examples
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@mastersthesis{krancenblum2025hebrew,
+  title={Developing Reasoning-Augmented Language Models for Hebrew Offensive Language Detection},
+  author={Krancenblum, Kevyn},
+  year={2025},
+  school={Sami Shamoon College of Engineering},
+  note={Model: https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection}
+}
+```
+### Cite DPO Method
 ```bibtex
 @inproceedings{rafailov2023direct,
     author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
     year         = 2023,
     booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
+    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
 }
 ```
+### Cite TRL Framework
 ```bibtex
 @misc{vonwerra2022trl,
+    title        = {{TRL: Transformer Reinforcement Learning}},
+    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
+    year         = 2020,
+    journal      = {GitHub repository},
+    publisher    = {GitHub},
+    howpublished = {\url{https://github.com/huggingface/trl}}
 }
+```
+## License
+MIT License - See LICENSE file for details
+## Acknowledgments
+- **Dicta Research Center** for DictaLM-2.0-Instruct base model
+- **OpenAI** for GPT-5 teacher supervision
+- **Hugging Face** for model hosting and transformers library
+- **OlaH-5000** and **HeDetox** dataset creators
+- **TRL Team** for Direct Preference Optimization implementation