Model Card for ShAIkespear/Phi-2_DPO_Anton
A LoRA-finetuned variant of microsoft/phi-2 targeting multiple-choice question answering (MCQA) tasks, developed as part of the ShAIkespear EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets.
This checkpoint emphasizes alignment on preference-based fine-tuning using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations.
Model Details
Model Description
This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the SFT → DPO sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization.
- Developed by: ShAIkespear team
- Shared by: ShAIkespear team
- Model type: Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model
- Language(s): English
- License: MIT
- Finetuned from model: microsoft/phi-2
Model Sources
- Repository: 2.8B-Phi-2-LLM-QA
- Report: “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”
Uses
Direct Use
- Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA).
- Research into DPO-aligned fine-tuning and human preference optimization.
- Exploration of prompt-format sensitivity for educational question-answering.
Out-of-Scope Use
- Critical decision-making (medical, legal, financial) without human oversight.
- Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA).
- Any form of automated exam-taking or misuse of assessment materials.
Bias, Risks, and Limitations
- Performance Variance: STEM questions remain challenging — accuracy near random (~0.25) on complex math/science.
- Overalignment: DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness.
- Data Sensitivity: Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool.
Recommendations
- Use for controlled research or tutoring scenarios, not autonomous grading.
- Always combine with explicit answer-format prompting (e.g., “### Answer: A”).
- Include human oversight for content validation or pedagogical deployment.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ShAIkespear/Phi-2_DPO_Anton"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
Training Details
Training Data
Fine-tuning and preference alignment performed on a mix of:
- Public MCQA datasets: MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K.
- Private data: EPFL student-curated MCQA and preference pairs (~20–30k).
- Filtered for ≤512 tokens; each dataset truncated to 20k samples.
- Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out.
Training Procedure
Preprocessing
Unified MCQA schema (id, subject, question, choices, correct answer).
Prompts structured as:
### Question ... ### Explanation ... ### Answer
For DPO:
- Chosen/Rejected preference pairs constructed from student feedback or model-comparison data.
Training Hyperparameters
- Regime: Mixed precision (fp16/bf16).
- LoRA: rank=16, α=16, dropout=0.05.
- Batch size: SFT = 4; DPO = 1.
- Learning rate: 1e-5 for public data; 1e-4 for EPFL preference data.
- Scheduler: Cosine with warmup.
- Frameworks: Hugging Face TRL + PEFT/LoRA.
Evaluation
Testing Data, Factors & Metrics
Testing Data
MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits.
Metrics
- MCQA accuracy (% correct).
- DPO alignment score (pairwise preference accuracy).
Results
Anton’s DPO variant showed improved alignment consistency and format stability over SFT-only checkpoints, especially on EPFL test sets. However, general MCQA accuracy remained comparable to other team models.
Summary
- Improved coherence and answer formatting.
- Better preference-following vs. SFT baseline.
- Slightly slower inference (due to full precision and DPO reference model usage).
Technical Specifications
Model Architecture and Objective
Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters. Trained with:
- SFT (next-token prediction).
- DPO (pairwise reward alignment on chosen/rejected responses).
Software
Hugging Face TRL, PEFT, Transformers, PyTorch.
Glossary
- MCQA: Multiple-choice question answering.
- SFT: Supervised finetuning on labeled answers.
- DPO: Direct Preference Optimization for alignment with human preferences.
- LoRA: Low-Rank Adaptation for efficient fine-tuning.
Model tree for ShAIkespear/Phi-2_DPO_Anton
Base model
microsoft/phi-2