Hebrew Offensive Language Detection with Reasoning (offensive_v5_dpo)

This model is a fine-tuned version of dicta-il/dictalm2.0-instruct specialized for detecting offensive language in Hebrew text while providing explainable rationales in Hebrew.

Model Repository: KevynKrancenblum/hebrew-offensive-detection

What Does This Model Do?

This model performs binary classification of Hebrew text to determine if it contains offensive language, with the unique capability of explaining its reasoning in Hebrew. It addresses critical challenges in Hebrew NLP:

Key Capabilities

  1. Offensive Language Detection: Classifies Hebrew text as offensive (label: 1) or non-offensive (label: 0)
  2. Explainable Predictions: Generates Hebrew rationales explaining why text is classified as offensive or not
  3. Cultural Awareness: Fine-tuned on Hebrew-specific offensive patterns including:
    • Cultural insults and slurs (Χ§ΧœΧœΧ•Χͺ)
    • Political and ethnic hate speech (Χ”Χ‘ΧͺΧ”)
    • Threats and aggressive language (ΧΧ™Χ•ΧžΧ™Χ)
    • Context-dependent offensiveness in Israeli discourse

Performance Metrics

Dataset Accuracy Precision Recall F1-Score
OlaH-5000 (test) 0.85 0.85 0.85 0.85
HeDetox (cross-domain) 0.91 0.92 0.91 0.91

Comparison with baselines:

  • AlephBERT (fine-tuned): 0.84 F1 (no explanations)
  • heBERT (fine-tuned): 0.85 F1 (no explanations)
  • GPT-5 (zero-shot): 0.77 F1 (lacks Hebrew cultural grounding)

Quick Start

Installation

pip install transformers torch peft bitsandbytes accelerate

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "KevynKrancenblum/hebrew-offensive-detection"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,  # Use 4-bit quantization for efficiency
    device_map="auto"
)

# Prepare system prompt in Hebrew
SYSTEM_PROMPT = """אΧͺΧ” ΧžΧ•ΧžΧ—Χ” ΧœΧ–Χ™Χ”Χ•Χ™ ΧͺΧ•Χ›ΧŸ Χ€Χ•Χ’Χ’Χ Χ™ Χ‘Χ’Χ‘Χ¨Χ™Χͺ. Χ ΧͺΧ— אΧͺ Χ”Χ˜Χ§Χ‘Χ˜ הבא Χ•Χ”Χ‘Χ‘Χ¨ אΧͺ Χ”Χ Χ™ΧžΧ•Χ§ שלך.
Χ‘Χ”ΧͺΧ‘Χ‘Χ‘ גל Χ”Χ Χ™ΧžΧ•Χ§, Χͺן ΧͺΧ•Χ•Χ™Χͺ: 1 ΧœΧ€Χ•Χ’Χ’Χ Χ™ או 0 ללא Χ€Χ•Χ’Χ’Χ Χ™."""

# Classification function
def classify_hebrew_text(text: str) -> dict:
    prompt = f"{SYSTEM_PROMPT}\n\nטקבט: \"{text}\""

    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(messages, tokenize=False)

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Parse response
    lines = response.split('\n')
    label = None
    reason = None

    for line in lines:
        if 'ΧͺΧ•Χ•Χ™Χͺ:' in line or 'label:' in line.lower():
            # Extract label (0 or 1)
            if '1' in line and 'Χ€Χ•Χ’Χ’Χ Χ™' in line:
                label = 1
            elif '0' in line:
                label = 0
        elif len(line.strip()) > 10 and label is None:
            # Rationale is typically the longer text after label
            reason = line.strip()

    return {
        "label": label,  # 1 = offensive, 0 = non-offensive
        "reason": reason,  # Hebrew explanation
        "full_response": response
    }

# Example usage
text = "יא ΧžΧ˜Χ•ΧžΧ˜Χ, לך ΧͺΧžΧ•Χͺ"
result = classify_hebrew_text(text)

print(f"Label: {result['label']}")
print(f"Reason: {result['reason']}")

Example Output

Input: "יא ΧžΧ˜Χ•ΧžΧ˜Χ, לך ΧͺΧžΧ•Χͺ"

Output:

Label: 1 (Offensive)
Reason: Χ”Χ˜Χ§Χ‘Χ˜ ΧžΧ›Χ™Χœ Χ§ΧœΧœΧ” ("ΧžΧ˜Χ•ΧžΧ˜Χ") ואיום ("לך ΧͺΧžΧ•Χͺ"), שניהם Χ‘Χ™Χ˜Χ•Χ™Χ™Χ ׀וגגניים Χ”ΧžΧ˜Χ¨Χͺם ΧœΧ”Χ©Χ€Χ™Χœ Χ•ΧœΧΧ™Χ™Χ.

Translation: "The text contains an insult ('idiot') and a threat ('go die'), both offensive expressions intended to humiliate and threaten."

Training Methodology

Three-Stage Alignment Pipeline

This model was developed through a sophisticated three-stage training process combining teacher-student learning with preference optimization:

Stage 1: Teacher-Generated Reasoning Supervision

  • Teacher Model: GPT-5 (gpt-5-preview)
  • Task: Generate high-quality Hebrew rationales explaining offensive/non-offensive classifications
  • Dataset: ~8,000 annotated samples from OlaH-5000
  • Output: Structured reasoning corpus in Hebrew

Stage 2: Supervised Fine-Tuning (SFT)

  • Base Model: DictaLM-2.0-Instruct (7B parameters, Mistral architecture)
  • Method: Parameter-Efficient Fine-Tuning (PEFT) using QLoRA
  • Training Details:
    • LoRA adapters: rank=256, alpha=512
    • 4-bit quantization (bitsandbytes)
    • Chain-of-thought supervision (model learns to generate rationale β†’ label)
    • Training time: ~12 hours on RTX 4080 SUPER (16GB VRAM)
  • Results: 74% F1 (improved neutrality handling)

Stage 3: Direct Preference Optimization (DPO)

  • Method: Iterative DPO alignment without reward model
  • Preference Pairs:
    • Chosen: GPT-5 teacher rationale (correct label + explanation)
    • Rejected: GPT-5-mini rationale (incorrect label + plausible but wrong explanation)
  • Three Iterations:
    • Round 1: 80% F1 (balanced precision-recall)
    • Round 2: 82% F1 (refined calibration)
    • Round 3 (this model): 85% F1 (optimal performance, stable explanations)

Why DPO?

Direct Preference Optimization was chosen over traditional RLHF/PPO because:

  • βœ… No separate reward model required
  • βœ… Computationally efficient (trainable on consumer GPUs)
  • βœ… Single-stage optimization
  • βœ… Comparable or superior performance to full RLHF
  • βœ… More stable training dynamics

Training Configuration

Hardware:

  • Single NVIDIA RTX 4080 SUPER (16GB VRAM)
  • Total training time: ~32 hours (all stages)

Hyperparameters:

  • Epochs: 50 (SFT), 3 (DPO iterations)
  • Batch size: 2 per device, gradient accumulation: 16 (effective batch = 32)
  • Learning rate: 2Γ—10⁻⁡ (linear warmup)
  • Max sequence length: 512 tokens
  • Precision: bfloat16
  • Optimizer: AdamW

Memory Optimization:

  • QLoRA reduces memory from ~28GB (FP16) to <7GB (4-bit)
  • Gradient checkpointing enabled
  • LoRA adapters: 67M trainable parameters (0.96% of base model)

Use Cases

This model is designed for:

  1. Content Moderation: Automated detection of offensive content in Hebrew social media, forums, and comment sections
  2. Educational Tools: Teaching about offensive language patterns with explainable feedback
  3. Research: Studying Hebrew offensive language and cultural hate speech patterns
  4. Compliance: Helping platforms enforce community guidelines in Hebrew

Datasets Used

  • OlaH-5000: Primary training dataset for Hebrew offensive language
  • HeDetox: Cross-domain evaluation dataset for Hebrew text detoxification

Limitations

  • Slang and Youth Language: May struggle with emerging slang, metaphorical insults, or internet-specific Hebrew
  • Spelling Variations: Performance degrades with unconventional spellings or corrupted text
  • Domain Specificity: Optimized for social media text (Twitter/Facebook style)
  • Cultural Subjectivity: Inherits biases from training data annotations
  • Context Length: Limited to 512 tokens (may miss context in very long texts)

Ethical Considerations

⚠️ Important: This model reflects cultural and contextual interpretations of offensiveness in Israeli Hebrew discourse. Classifications should be:

  • Used as decision support, not sole determinant
  • Combined with human review for sensitive moderation decisions
  • Regularly evaluated for bias and fairness
  • Contextualized to specific use cases and communities

Training Procedure

This model was trained with Direct Preference Optimization (DPO), a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Visualize in Weights & Biases

Framework Versions

  • PEFT: 0.17.0
  • TRL: 0.21.0
  • Transformers: 4.55.2
  • PyTorch: 2.6.0+cu124
  • Datasets: 4.0.0
  • Tokenizers: 0.21.4
  • bitsandbytes: (4-bit quantization)

Repository and Resources

Citation

If you use this model in your research, please cite:

@mastersthesis{krancenblum2025hebrew,
  title={Developing Reasoning-Augmented Language Models for Hebrew Offensive Language Detection},
  author={Krancenblum, Kevyn},
  year={2025},
  school={Sami Shamoon College of Engineering},
  note={Model: https://huggingface.co/KevynKrancenblum/hebrew-offensive-detection}
}

Cite DPO Method

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}

Cite TRL Framework

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

License

MIT License - See LICENSE file for details

Acknowledgments

  • Dicta Research Center for DictaLM-2.0-Instruct base model
  • OpenAI for GPT-5 teacher supervision
  • Hugging Face for model hosting and transformers library
  • OlaH-5000 and HeDetox dataset creators
  • TRL Team for Direct Preference Optimization implementation
Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for KevynKrancenblum/hebrew-offensive-detection

Adapter
(1)
this model