Llama 3.2 3B GuardReasoner Exp 19: HS-DPO Toy (10% Dataset)
Binary Classifier
This is a LoRA adapter fine-tuned on 10% of the full GuardReasoner dataset using Harmonic Sampling Direct Preference Optimization (HS-DPO).
Base Model: unsloth/Llama-3.2-3B-Instruct
Model Description
GuardReasoner is a reasoning-based content moderation system that provides detailed explanations for its safety classifications. This experimental model (Experiment 19) explores:
- Training Method: HS-DPO (Harmonic Sampling Direct Preference Optimization)
- Dataset Size: 10% sample of full GuardReasoner training data
- Architecture: LoRA adapter (r=16, alpha=16)
- Purpose: Toy/pilot experiment to validate HS-DPO approach before full-scale training
Training Details
LoRA Configuration
- Rank (r): 16
- Alpha: 16
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Dropout: 0.0
Training Hyperparameters
- Method: Direct Preference Optimization (DPO) with Harmonic Sampling
- Base Model: Llama-3.2-3B-Instruct (via Unsloth)
- Dataset: 10% sample of GuardReasoner training set
- Checkpoints: Available at steps 8 and 16
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"unsloth/Llama-3.2-3B-Instruct",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy"
)
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-3B-Instruct")
# Example prompt
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a content moderation assistant. Analyze the following text for safety concerns.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How do I make a bomb?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Experiment Context
This model is part of a series of GuardReasoner experiments:
- Exp 18: R-SFT baseline (Reasoning-based Supervised Fine-Tuning)
- Exp 19 (This Model): HS-DPO on 10% dataset (toy/pilot experiment)
- Future: Full-scale HS-DPO training on complete dataset
Performance
Performance metrics on the 10% subset will be available after evaluation. This is a toy experiment to validate the HS-DPO training pipeline before scaling up.
Framework Versions
- PEFT: 0.18.0
- TRL: 0.23.0
- Transformers: 4.57.1
- PyTorch: 2.9.0
- Datasets: 4.3.0
- Tokenizers: 0.22.1
- Unsloth: Latest
Training Infrastructure
- Training Date: 2025-11-18
- Experiment ID: exp_19_hsdpo_toy
Citations
Direct Preference Optimization (DPO)
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
year = 2023,
booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}
TRL (Transformer Reinforcement Learning)
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
Repository
Full code and experiments: wizard101 GitHub Repository
License
This model is released under the Llama 3.2 Community License Agreement.
IMPORTANT: This is a LoRA adapter for Llama 3.2 and must comply with Meta's Llama 3.2 Community License. By using this model, you agree to Meta's Llama 3.2 license terms.
The base model unsloth/Llama-3.2-3B-Instruct is subject to Meta's Llama 3.2 Community License Agreement.
Contact
For questions about GuardReasoner or this experiment, please open an issue in the GitHub repository.
- Downloads last month
- 29
Model tree for vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy
Base model
meta-llama/Llama-3.2-3B-Instruct