--- library_name: transformers license: mit language: - en base_model: - microsoft/phi-2 --- # Model Card for ShAIkespear/Phi-2_DPO_Anton A LoRA-finetuned variant of **microsoft/phi-2** targeting multiple-choice question answering (MCQA) tasks, developed as part of the *ShAIkespear* EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets. This checkpoint emphasizes alignment on *preference-based fine-tuning* using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations. --- ## Model Details ### Model Description This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the *SFT → DPO* sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization. * **Developed by:** ShAIkespear team * **Shared by:** ShAIkespear team * **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model * **Language(s):** English * **License:** MIT * **Finetuned from model:** microsoft/phi-2 --- ### Model Sources * **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA) * **Report:** “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions” --- ## Uses ### Direct Use * Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA). * Research into DPO-aligned fine-tuning and human preference optimization. * Exploration of prompt-format sensitivity for educational question-answering. ### Out-of-Scope Use * Critical decision-making (medical, legal, financial) without human oversight. * Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA). * Any form of automated exam-taking or misuse of assessment materials. --- ## Bias, Risks, and Limitations * **Performance Variance:** STEM questions remain challenging — accuracy near random (~0.25) on complex math/science. * **Overalignment:** DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness. * **Data Sensitivity:** Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool. ### Recommendations * Use for controlled research or tutoring scenarios, not autonomous grading. * Always combine with explicit answer-format prompting (e.g., “### Answer: A”). * Include human oversight for content validation or pedagogical deployment. --- ## How to Get Started with the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "ShAIkespear/Phi-2_DPO_Anton" tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:" inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=15) print(tok.decode(out[0], skip_special_tokens=True)) ``` --- ## Training Details ### Training Data Fine-tuning and preference alignment performed on a mix of: * **Public MCQA datasets:** MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K. * **Private data:** EPFL student-curated MCQA and preference pairs (~20–30k). * Filtered for ≤512 tokens; each dataset truncated to 20k samples. * Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out. ### Training Procedure #### Preprocessing Unified MCQA schema (id, subject, question, choices, correct answer). Prompts structured as: `### Question ... ### Explanation ... ### Answer` For DPO: * **Chosen/Rejected** preference pairs constructed from student feedback or model-comparison data. #### Training Hyperparameters * **Regime:** Mixed precision (fp16/bf16). * **LoRA:** rank=16, α=16, dropout=0.05. * **Batch size:** SFT = 4; DPO = 1. * **Learning rate:** 1e-5 for public data; 1e-4 for EPFL preference data. * **Scheduler:** Cosine with warmup. * **Frameworks:** Hugging Face TRL + PEFT/LoRA. --- ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits. #### Metrics * MCQA accuracy (% correct). * DPO alignment score (pairwise preference accuracy). ### Results Anton’s DPO variant showed improved *alignment consistency* and *format stability* over SFT-only checkpoints, especially on EPFL test sets. However, general MCQA accuracy remained comparable to other team models. #### Summary * Improved coherence and answer formatting. * Better preference-following vs. SFT baseline. * Slightly slower inference (due to full precision and DPO reference model usage). --- ## Technical Specifications ### Model Architecture and Objective Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters. Trained with: * SFT (next-token prediction). * DPO (pairwise reward alignment on chosen/rejected responses). #### Software Hugging Face TRL, PEFT, Transformers, PyTorch. --- ## Glossary * **MCQA:** Multiple-choice question answering. * **SFT:** Supervised finetuning on labeled answers. * **DPO:** Direct Preference Optimization for alignment with human preferences. * **LoRA:** Low-Rank Adaptation for efficient fine-tuning.