Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt
A LoRA-finetuned and **Direct Preference Optimization (DPO)**–aligned variant of microsoft/phi-2, specialized for multiple-choice question answering (MCQA) with an emphasis on STEM and general knowledge domains.
This model represents the alternative base configuration of the final M3 (balanced-then-DPO) training pipeline from the ShAIkespear project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization.
Model Details
- Developed by: ShAIkespear team
- Shared by: ShAIkespear team
- Model type: Causal LM (Phi-2) with LoRA adapters; DPO-aligned
- Languages: English
- License: MIT
- Finetuned from: microsoft/phi-2
Model Sources
- Repository: 2.8B-Phi-2-LLM-QA
- Report: “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”
Uses
Direct Use
- MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA).
- Alignment research — comparison between DPO training setups (Base vs. Quantized).
- As a high-fidelity reference checkpoint for quantized and downstream variants.
Out-of-Scope Use
- High-stakes or safety-critical applications (medical, legal, policy).
- Generative tasks outside multiple-choice reasoning.
- Misuse in automated exam solving or confidential data leakage.
Bias, Risks, and Limitations
- Domain bias: Stronger on factual MCQA, weaker on advanced reasoning tasks.
- Answer drift: May occasionally produce verbose or follow-up answers without explicit formatting.
- Data source risks: EPFL-derived preferences may encode narrow style biases.
Recommendations
Maintain the structured prompt format:
### Question ...
### Explanation ...
### Answer:
Keep human supervision in any educational or grading use.
Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment.
How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
Training Details
Training Data
- SFT stage: Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets.
- DPO stage: Human preference pairs (EPFL exams + public feedback datasets like HelpSteer).
- Schema: Unified “### Question / ### Explanation / ### Answer” format.
- Filtering: ≤512 tokens, balanced sample caps (~20k per dataset).
Training Procedure
- Pipeline: SFT → DPO (M3 configuration).
- LoRA parameters: rank = 16, α = 16, dropout = 0.05.
- Batch sizes: SFT = 4; DPO = 1.
- Learning rates: 1e-5 (public) / 1e-4 (EPFL).
- Scheduler: Cosine with warmup.
- Frameworks: Hugging Face Transformers + TRL + PEFT (LoRA).
Evaluation Summary
- Configuration: M3 Base (Alt) is the unquantized reference model for the quantized 8-bit variant.
- Performance: Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment.
- Accuracy: Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks.
- Use case: For experimentation, evaluation, or further domain-specific fine-tuning.
Technical Specifications
- Architecture: Phi-2 (~2.78B parameters), decoder-only transformer.
- Objective: SFT next-token prediction + DPO preference alignment.
- Precision: Full precision (fp16/bf16).
- Software: Hugging Face Transformers, TRL, PEFT.
Glossary
- MCQA: Multiple-Choice Question Answering
- SFT: Supervised Finetuning
- DPO: Direct Preference Optimization
- LoRA: Low-Rank Adaptation
- Alt (Alternative): Internal naming for the alternate full-precision checkpoint variant of M3