--- library_name: transformers license: mit language: - en base_model: - microsoft/phi-2 --- # Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt A **LoRA-finetuned** and **Direct Preference Optimization (DPO)**–aligned variant of **microsoft/phi-2**, specialized for **multiple-choice question answering (MCQA)** with an emphasis on **STEM and general knowledge** domains. This model represents the *alternative base configuration* of the final **M3 (balanced-then-DPO)** training pipeline from the *ShAIkespear* project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization. --- ## Model Details * **Developed by:** ShAIkespear team * **Shared by:** ShAIkespear team * **Model type:** Causal LM (Phi-2) with LoRA adapters; DPO-aligned * **Languages:** English * **License:** MIT * **Finetuned from:** microsoft/phi-2 ### Model Sources * **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA) * **Report:** *“ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”* --- ## Uses ### Direct Use * MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA). * Alignment research — comparison between DPO training setups (Base vs. Quantized). * As a **high-fidelity reference checkpoint** for quantized and downstream variants. ### Out-of-Scope Use * High-stakes or safety-critical applications (medical, legal, policy). * Generative tasks outside multiple-choice reasoning. * Misuse in automated exam solving or confidential data leakage. --- ## Bias, Risks, and Limitations * **Domain bias:** Stronger on factual MCQA, weaker on advanced reasoning tasks. * **Answer drift:** May occasionally produce verbose or follow-up answers without explicit formatting. * **Data source risks:** EPFL-derived preferences may encode narrow style biases. ### Recommendations * Maintain the structured prompt format: ``` ### Question ... ### Explanation ... ### Answer: ``` * Keep human supervision in any educational or grading use. * Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment. --- ## How to Get Started ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt" tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:" inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=15) print(tok.decode(out[0], skip_special_tokens=True)) ``` --- ## Training Details ### Training Data * **SFT stage:** Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets. * **DPO stage:** Human preference pairs (EPFL exams + public feedback datasets like HelpSteer). * **Schema:** Unified “### Question / ### Explanation / ### Answer” format. * **Filtering:** ≤512 tokens, balanced sample caps (~20k per dataset). ### Training Procedure * **Pipeline:** SFT → DPO (M3 configuration). * **LoRA parameters:** rank = 16, α = 16, dropout = 0.05. * **Batch sizes:** SFT = 4; DPO = 1. * **Learning rates:** 1e-5 (public) / 1e-4 (EPFL). * **Scheduler:** Cosine with warmup. * **Frameworks:** Hugging Face Transformers + TRL + PEFT (LoRA). --- ## Evaluation Summary * **Configuration:** *M3 Base (Alt)* is the unquantized reference model for the quantized 8-bit variant. * **Performance:** Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment. * **Accuracy:** Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks. * **Use case:** For experimentation, evaluation, or further domain-specific fine-tuning. --- ## Technical Specifications * **Architecture:** Phi-2 (~2.78B parameters), decoder-only transformer. * **Objective:** SFT next-token prediction + DPO preference alignment. * **Precision:** Full precision (fp16/bf16). * **Software:** Hugging Face Transformers, TRL, PEFT. --- ## Glossary * **MCQA:** Multiple-Choice Question Answering * **SFT:** Supervised Finetuning * **DPO:** Direct Preference Optimization * **LoRA:** Low-Rank Adaptation * **Alt (Alternative):** Internal naming for the alternate full-precision checkpoint variant of M3