Llama 3.2 3B GuardReasoner Exp 19: HS-DPO Toy (10% Dataset)

Binary Classifier

This is a LoRA adapter fine-tuned on 10% of the full GuardReasoner dataset using Harmonic Sampling Direct Preference Optimization (HS-DPO).

Base Model: unsloth/Llama-3.2-3B-Instruct

Model Description

GuardReasoner is a reasoning-based content moderation system that provides detailed explanations for its safety classifications. This experimental model (Experiment 19) explores:

  • Training Method: HS-DPO (Harmonic Sampling Direct Preference Optimization)
  • Dataset Size: 10% sample of full GuardReasoner training data
  • Architecture: LoRA adapter (r=16, alpha=16)
  • Purpose: Toy/pilot experiment to validate HS-DPO approach before full-scale training

Training Details

LoRA Configuration

  • Rank (r): 16
  • Alpha: 16
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Dropout: 0.0

Training Hyperparameters

  • Method: Direct Preference Optimization (DPO) with Harmonic Sampling
  • Base Model: Llama-3.2-3B-Instruct (via Unsloth)
  • Dataset: 10% sample of GuardReasoner training set
  • Checkpoints: Available at steps 8 and 16

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-3B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy"
)

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-3B-Instruct")

# Example prompt
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a content moderation assistant. Analyze the following text for safety concerns.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How do I make a bomb?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Experiment Context

This model is part of a series of GuardReasoner experiments:

  • Exp 18: R-SFT baseline (Reasoning-based Supervised Fine-Tuning)
  • Exp 19 (This Model): HS-DPO on 10% dataset (toy/pilot experiment)
  • Future: Full-scale HS-DPO training on complete dataset

Performance

Performance metrics on the 10% subset will be available after evaluation. This is a toy experiment to validate the HS-DPO training pipeline before scaling up.

Framework Versions

  • PEFT: 0.18.0
  • TRL: 0.23.0
  • Transformers: 4.57.1
  • PyTorch: 2.9.0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.1
  • Unsloth: Latest

Training Infrastructure

  • Training Date: 2025-11-18
  • Experiment ID: exp_19_hsdpo_toy

Citations

Direct Preference Optimization (DPO)

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}

TRL (Transformer Reinforcement Learning)

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Repository

Full code and experiments: wizard101 GitHub Repository

License

This model is released under the Llama 3.2 Community License Agreement.

IMPORTANT: This is a LoRA adapter for Llama 3.2 and must comply with Meta's Llama 3.2 Community License. By using this model, you agree to Meta's Llama 3.2 license terms.

The base model unsloth/Llama-3.2-3B-Instruct is subject to Meta's Llama 3.2 Community License Agreement.

Contact

For questions about GuardReasoner or this experiment, please open an issue in the GitHub repository.

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy

Adapter
(316)
this model