Hebrew Offensive Language Detection with Reasoning (offensive_v5_dpo)

This model is a fine-tuned version of dicta-il/dictalm2.0-instruct specialized for detecting offensive language in Hebrew text while providing explainable rationales in Hebrew.

Model Repository: KevynKrancenblum/hebrew-offensive-detection

What Does This Model Do?

This model performs binary classification of Hebrew text to determine if it contains offensive language, with the unique capability of explaining its reasoning in Hebrew. It addresses critical challenges in Hebrew NLP:

Key Capabilities

Offensive Language Detection: Classifies Hebrew text as offensive (label: 1) or non-offensive (label: 0)
Explainable Predictions: Generates Hebrew rationales explaining why text is classified as offensive or not
Cultural Awareness: Fine-tuned on Hebrew-specific offensive patterns including:
- Cultural insults and slurs (קללות)
- Political and ethnic hate speech (הסתה)
- Threats and aggressive language (איומים)
- Context-dependent offensiveness in Israeli discourse

Performance Metrics

Dataset	Accuracy	Precision	Recall	F1-Score
OlaH-5000 (test)	0.85	0.85	0.85	0.85
HeDetox (cross-domain)	0.91	0.92	0.91	0.91

Comparison with baselines:

AlephBERT (fine-tuned): 0.84 F1 (no explanations)
heBERT (fine-tuned): 0.85 F1 (no explanations)
GPT-5 (zero-shot): 0.77 F1 (lacks Hebrew cultural grounding)

Quick Start

Installation

pip install transformers torch peft bitsandbytes accelerate

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "KevynKrancenblum/hebrew-offensive-detection"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,  # Use 4-bit quantization for efficiency
    device_map="auto"
)

# Prepare system prompt in Hebrew
SYSTEM_PROMPT = """אתה מומחה לזיהוי תוכן פוגעני בעברית. נתח את הטקסט הבא והסבר את הנימוק שלך.
בהתבסס על הנימוק, תן תווית: 1 לפוגעני או 0 ללא פוגעני."""

# Classification function
def classify_hebrew_text(text: str) -> dict:
    prompt = f"{SYSTEM_PROMPT}\n\nטקסט: \"{text}\""

    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(messages, tokenize=False)

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Parse response
    lines = response.split('\n')
    label = None
    reason = None

    for line in lines:
        if 'תווית:' in line or 'label:' in line.lower():
            # Extract label (0 or 1)
            if '1' in line and 'פוגעני' in line:
                label = 1
            elif '0' in line:
                label = 0
        elif len(line.strip()) > 10 and label is None:
            # Rationale is typically the longer text after label
            reason = line.strip()

    return {
        "label": label,  # 1 = offensive, 0 = non-offensive
        "reason": reason,  # Hebrew explanation
        "full_response": response
    }

# Example usage
text = "יא מטומטם, לך תמות"
result = classify_hebrew_text(text)

print(f"Label: {result['label']}")
print(f"Reason: {result['reason']}")

Example Output

Input: "יא מטומטם, לך תמות"

Output:

Label: 1 (Offensive)
Reason: הטקסט מכיל קללה ("מטומטם") ואיום ("לך תמות"), שניהם ביטויים פוגעניים המטרתם להשפיל ולאיים.

Translation: "The text contains an insult ('idiot') and a threat ('go die'), both offensive expressions intended to humiliate and threaten."

Training Methodology

Three-Stage Alignment Pipeline

This model was developed through a sophisticated three-stage training process combining teacher-student learning with preference optimization:

Stage 1: Teacher-Generated Reasoning Supervision

Teacher Model: GPT-5 (gpt-5-preview)
Task: Generate high-quality Hebrew rationales explaining offensive/non-offensive classifications
Dataset: ~8,000 annotated samples from OlaH-5000
Output: Structured reasoning corpus in Hebrew

Stage 2: Supervised Fine-Tuning (SFT)

Base Model: DictaLM-2.0-Instruct (7B parameters, Mistral architecture)
Method: Parameter-Efficient Fine-Tuning (PEFT) using QLoRA
Training Details:
- LoRA adapters: rank=256, alpha=512
- 4-bit quantization (bitsandbytes)
- Chain-of-thought supervision (model learns to generate rationale → label)
- Training time: ~12 hours on RTX 4080 SUPER (16GB VRAM)
Results: 74% F1 (improved neutrality handling)

Stage 3: Direct Preference Optimization (DPO)

Method: Iterative DPO alignment without reward model
Preference Pairs:
- Chosen: GPT-5 teacher rationale (correct label + explanation)
- Rejected: GPT-5-mini rationale (incorrect label + plausible but wrong explanation)
Three Iterations:
- Round 1: 80% F1 (balanced precision-recall)
- Round 2: 82% F1 (refined calibration)
- Round 3 (this model): 85% F1 (optimal performance, stable explanations)

Why DPO?

Direct Preference Optimization was chosen over traditional RLHF/PPO because:

✅ No separate reward model required
✅ Computationally efficient (trainable on consumer GPUs)
✅ Single-stage optimization
✅ Comparable or superior performance to full RLHF
✅ More stable training dynamics

Training Configuration

Hardware:

Single NVIDIA RTX 4080 SUPER (16GB VRAM)
Total training time: ~32 hours (all stages)

Hyperparameters:

Epochs: 50 (SFT), 3 (DPO iterations)
Batch size: 2 per device, gradient accumulation: 16 (effective batch = 32)
Learning rate: 2×10⁻⁵ (linear warmup)
Max sequence length: 512 tokens
Precision: bfloat16
Optimizer: AdamW

Memory Optimization:

QLoRA reduces memory from ~28GB (FP16) to <7GB (4-bit)
Gradient checkpointing enabled
LoRA adapters: ~~67M trainable parameters (~~0.96% of base model)

Use Cases

This model is designed for:

Content Moderation: Automated detection of offensive content in Hebrew social media, forums, and comment sections
Educational Tools: Teaching about offensive language patterns with explainable feedback
Research: Studying Hebrew offensive language and cultural hate speech patterns
Compliance: Helping platforms enforce community guidelines in Hebrew

Datasets Used

OlaH-5000: Primary training dataset for Hebrew offensive language
HeDetox: Cross-domain evaluation dataset for Hebrew text detoxification

Limitations

Slang and Youth Language: May struggle with emerging slang, metaphorical insults, or internet-specific Hebrew
Spelling Variations: Performance degrades with unconventional spellings or corrupted text
Domain Specificity: Optimized for social media text (Twitter/Facebook style)
Cultural Subjectivity: Inherits biases from training data annotations
Context Length: Limited to 512 tokens (may miss context in very long texts)

Ethical Considerations

⚠️ Important: This model reflects cultural and contextual interpretations of offensiveness in Israeli Hebrew discourse. Classifications should be:

Used as decision support, not sole determinant
Combined with human review for sensitive moderation decisions
Regularly evaluated for bias and fairness
Contextualized to specific use cases and communities

Training Procedure

This model was trained with Direct Preference Optimization (DPO), a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Framework Versions

PEFT: 0.17.0
TRL: 0.21.0
Transformers: 4.55.2
PyTorch: 2.6.0+cu124
Datasets: 4.0.0
Tokenizers: 0.21.4
bitsandbytes: (4-bit quantization)

Repository and Resources

GitHub Repository: KevynKrancenblum/hebrew-offensive-detection
Interactive Demo: Streamlit web interface included in repository
Documentation: Comprehensive README with usage examples

Citation

If you use this model in your research, please cite:

@mastersthesis{krancenblum2025hebrew,
  title={Developing Reasoning-Augmented Language Models for Hebrew Offensive Language Detection},
  author={Krancenblum, Kevyn},
  year={2025},
  school={Sami Shamoon College of Engineering},
  note={Model: https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection}
}

Cite DPO Method

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}

Cite TRL Framework

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

License

MIT License - See LICENSE file for details

Acknowledgments

Dicta Research Center for DictaLM-2.0-Instruct base model
OpenAI for GPT-5 teacher supervision
Hugging Face for model hosting and transformers library
OlaH-5000 and HeDetox dataset creators
TRL Team for Direct Preference Optimization implementation

Downloads last month: 40

Model tree for KevynKrancenblum/hebrew-offensive-detection

Base model

dicta-il/dictalm2.0

Finetuned

dicta-il/dictalm2.0-instruct

Adapter

(1)

this model