Hebrew Offensive Language Detection with Reasoning (offensive_v5_dpo)
This model is a fine-tuned version of dicta-il/dictalm2.0-instruct specialized for detecting offensive language in Hebrew text while providing explainable rationales in Hebrew.
Model Repository: KevynKrancenblum/hebrew-offensive-detection
What Does This Model Do?
This model performs binary classification of Hebrew text to determine if it contains offensive language, with the unique capability of explaining its reasoning in Hebrew. It addresses critical challenges in Hebrew NLP:
Key Capabilities
- Offensive Language Detection: Classifies Hebrew text as offensive (label: 1) or non-offensive (label: 0)
- Explainable Predictions: Generates Hebrew rationales explaining why text is classified as offensive or not
- Cultural Awareness: Fine-tuned on Hebrew-specific offensive patterns including:
- Cultural insults and slurs (Χ§ΧΧΧΧͺ)
- Political and ethnic hate speech (ΧΧ‘ΧͺΧ)
- Threats and aggressive language (ΧΧΧΧΧΧ)
- Context-dependent offensiveness in Israeli discourse
Performance Metrics
| Dataset | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| OlaH-5000 (test) | 0.85 | 0.85 | 0.85 | 0.85 |
| HeDetox (cross-domain) | 0.91 | 0.92 | 0.91 | 0.91 |
Comparison with baselines:
- AlephBERT (fine-tuned): 0.84 F1 (no explanations)
- heBERT (fine-tuned): 0.85 F1 (no explanations)
- GPT-5 (zero-shot): 0.77 F1 (lacks Hebrew cultural grounding)
Quick Start
Installation
pip install transformers torch peft bitsandbytes accelerate
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "KevynKrancenblum/hebrew-offensive-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True, # Use 4-bit quantization for efficiency
device_map="auto"
)
# Prepare system prompt in Hebrew
SYSTEM_PROMPT = """ΧΧͺΧ ΧΧΧΧΧ ΧΧΧΧΧΧ ΧͺΧΧΧ Χ€ΧΧΧ’Χ Χ ΧΧ’ΧΧ¨ΧΧͺ. Χ ΧͺΧ ΧΧͺ ΧΧΧ§Χ‘Χ ΧΧΧ ΧΧΧ‘ΧΧ¨ ΧΧͺ ΧΧ ΧΧΧΧ§ Χ©ΧΧ.
ΧΧΧͺΧΧ‘Χ‘ Χ’Χ ΧΧ ΧΧΧΧ§, ΧͺΧ ΧͺΧΧΧΧͺ: 1 ΧΧ€ΧΧΧ’Χ Χ ΧΧ 0 ΧΧΧ Χ€ΧΧΧ’Χ Χ."""
# Classification function
def classify_hebrew_text(text: str) -> dict:
prompt = f"{SYSTEM_PROMPT}\n\nΧΧ§Χ‘Χ: \"{text}\""
messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.2,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Parse response
lines = response.split('\n')
label = None
reason = None
for line in lines:
if 'ΧͺΧΧΧΧͺ:' in line or 'label:' in line.lower():
# Extract label (0 or 1)
if '1' in line and 'Χ€ΧΧΧ’Χ Χ' in line:
label = 1
elif '0' in line:
label = 0
elif len(line.strip()) > 10 and label is None:
# Rationale is typically the longer text after label
reason = line.strip()
return {
"label": label, # 1 = offensive, 0 = non-offensive
"reason": reason, # Hebrew explanation
"full_response": response
}
# Example usage
text = "ΧΧ ΧΧΧΧΧΧ, ΧΧ ΧͺΧΧΧͺ"
result = classify_hebrew_text(text)
print(f"Label: {result['label']}")
print(f"Reason: {result['reason']}")
Example Output
Input: "ΧΧ ΧΧΧΧΧΧ, ΧΧ ΧͺΧΧΧͺ"
Output:
Label: 1 (Offensive)
Reason: ΧΧΧ§Χ‘Χ ΧΧΧΧ Χ§ΧΧΧ ("ΧΧΧΧΧΧ") ΧΧΧΧΧ ("ΧΧ ΧͺΧΧΧͺ"), Χ©Χ ΧΧΧ ΧΧΧΧΧΧΧ Χ€ΧΧΧ’Χ ΧΧΧ ΧΧΧΧ¨ΧͺΧ ΧΧΧ©Χ€ΧΧ ΧΧΧΧΧΧ.
Translation: "The text contains an insult ('idiot') and a threat ('go die'), both offensive expressions intended to humiliate and threaten."
Training Methodology
Three-Stage Alignment Pipeline
This model was developed through a sophisticated three-stage training process combining teacher-student learning with preference optimization:
Stage 1: Teacher-Generated Reasoning Supervision
- Teacher Model: GPT-5 (gpt-5-preview)
- Task: Generate high-quality Hebrew rationales explaining offensive/non-offensive classifications
- Dataset: ~8,000 annotated samples from OlaH-5000
- Output: Structured reasoning corpus in Hebrew
Stage 2: Supervised Fine-Tuning (SFT)
- Base Model: DictaLM-2.0-Instruct (7B parameters, Mistral architecture)
- Method: Parameter-Efficient Fine-Tuning (PEFT) using QLoRA
- Training Details:
- LoRA adapters: rank=256, alpha=512
- 4-bit quantization (bitsandbytes)
- Chain-of-thought supervision (model learns to generate rationale β label)
- Training time: ~12 hours on RTX 4080 SUPER (16GB VRAM)
- Results: 74% F1 (improved neutrality handling)
Stage 3: Direct Preference Optimization (DPO)
- Method: Iterative DPO alignment without reward model
- Preference Pairs:
- Chosen: GPT-5 teacher rationale (correct label + explanation)
- Rejected: GPT-5-mini rationale (incorrect label + plausible but wrong explanation)
- Three Iterations:
- Round 1: 80% F1 (balanced precision-recall)
- Round 2: 82% F1 (refined calibration)
- Round 3 (this model): 85% F1 (optimal performance, stable explanations)
Why DPO?
Direct Preference Optimization was chosen over traditional RLHF/PPO because:
- β No separate reward model required
- β Computationally efficient (trainable on consumer GPUs)
- β Single-stage optimization
- β Comparable or superior performance to full RLHF
- β More stable training dynamics
Training Configuration
Hardware:
- Single NVIDIA RTX 4080 SUPER (16GB VRAM)
- Total training time: ~32 hours (all stages)
Hyperparameters:
- Epochs: 50 (SFT), 3 (DPO iterations)
- Batch size: 2 per device, gradient accumulation: 16 (effective batch = 32)
- Learning rate: 2Γ10β»β΅ (linear warmup)
- Max sequence length: 512 tokens
- Precision: bfloat16
- Optimizer: AdamW
Memory Optimization:
- QLoRA reduces memory from ~28GB (FP16) to <7GB (4-bit)
- Gradient checkpointing enabled
- LoRA adapters:
67M trainable parameters (0.96% of base model)
Use Cases
This model is designed for:
- Content Moderation: Automated detection of offensive content in Hebrew social media, forums, and comment sections
- Educational Tools: Teaching about offensive language patterns with explainable feedback
- Research: Studying Hebrew offensive language and cultural hate speech patterns
- Compliance: Helping platforms enforce community guidelines in Hebrew
Datasets Used
- OlaH-5000: Primary training dataset for Hebrew offensive language
- HeDetox: Cross-domain evaluation dataset for Hebrew text detoxification
Limitations
- Slang and Youth Language: May struggle with emerging slang, metaphorical insults, or internet-specific Hebrew
- Spelling Variations: Performance degrades with unconventional spellings or corrupted text
- Domain Specificity: Optimized for social media text (Twitter/Facebook style)
- Cultural Subjectivity: Inherits biases from training data annotations
- Context Length: Limited to 512 tokens (may miss context in very long texts)
Ethical Considerations
β οΈ Important: This model reflects cultural and contextual interpretations of offensiveness in Israeli Hebrew discourse. Classifications should be:
- Used as decision support, not sole determinant
- Combined with human review for sensitive moderation decisions
- Regularly evaluated for bias and fairness
- Contextualized to specific use cases and communities
Training Procedure
This model was trained with Direct Preference Optimization (DPO), a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Framework Versions
- PEFT: 0.17.0
- TRL: 0.21.0
- Transformers: 4.55.2
- PyTorch: 2.6.0+cu124
- Datasets: 4.0.0
- Tokenizers: 0.21.4
- bitsandbytes: (4-bit quantization)
Repository and Resources
- GitHub Repository: KevynKrancenblum/hebrew-offensive-detection
- Interactive Demo: Streamlit web interface included in repository
- Documentation: Comprehensive README with usage examples
Citation
If you use this model in your research, please cite:
@mastersthesis{krancenblum2025hebrew,
title={Developing Reasoning-Augmented Language Models for Hebrew Offensive Language Detection},
author={Krancenblum, Kevyn},
year={2025},
school={Sami Shamoon College of Engineering},
note={Model: https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection}
}
Cite DPO Method
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
year = 2023,
booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}
Cite TRL Framework
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
License
MIT License - See LICENSE file for details
Acknowledgments
- Dicta Research Center for DictaLM-2.0-Instruct base model
- OpenAI for GPT-5 teacher supervision
- Hugging Face for model hosting and transformers library
- OlaH-5000 and HeDetox dataset creators
- TRL Team for Direct Preference Optimization implementation
- Downloads last month
- 40