File size: 5,533 Bytes
a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 e851b13 a6c6e79 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---
# Model Card for ShAIkespear/Phi-2_DPO_Anton
A LoRA-finetuned variant of **microsoft/phi-2** targeting multiple-choice question answering (MCQA) tasks, developed as part of the *ShAIkespear* EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets.
This checkpoint emphasizes alignment on *preference-based fine-tuning* using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations.
---
## Model Details
### Model Description
This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the *SFT → DPO* sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization.
* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model
* **Language(s):** English
* **License:** MIT
* **Finetuned from model:** microsoft/phi-2
---
### Model Sources
* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”
---
## Uses
### Direct Use
* Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA).
* Research into DPO-aligned fine-tuning and human preference optimization.
* Exploration of prompt-format sensitivity for educational question-answering.
### Out-of-Scope Use
* Critical decision-making (medical, legal, financial) without human oversight.
* Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA).
* Any form of automated exam-taking or misuse of assessment materials.
---
## Bias, Risks, and Limitations
* **Performance Variance:** STEM questions remain challenging — accuracy near random (~0.25) on complex math/science.
* **Overalignment:** DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness.
* **Data Sensitivity:** Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool.
### Recommendations
* Use for controlled research or tutoring scenarios, not autonomous grading.
* Always combine with explicit answer-format prompting (e.g., “### Answer: A”).
* Include human oversight for content validation or pedagogical deployment.
---
## How to Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ShAIkespear/Phi-2_DPO_Anton"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```
---
## Training Details
### Training Data
Fine-tuning and preference alignment performed on a mix of:
* **Public MCQA datasets:** MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K.
* **Private data:** EPFL student-curated MCQA and preference pairs (~20–30k).
* Filtered for ≤512 tokens; each dataset truncated to 20k samples.
* Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out.
### Training Procedure
#### Preprocessing
Unified MCQA schema (id, subject, question, choices, correct answer).
Prompts structured as:
`### Question ... ### Explanation ... ### Answer`
For DPO:
* **Chosen/Rejected** preference pairs constructed from student feedback or model-comparison data.
#### Training Hyperparameters
* **Regime:** Mixed precision (fp16/bf16).
* **LoRA:** rank=16, α=16, dropout=0.05.
* **Batch size:** SFT = 4; DPO = 1.
* **Learning rate:** 1e-5 for public data; 1e-4 for EPFL preference data.
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face TRL + PEFT/LoRA.
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits.
#### Metrics
* MCQA accuracy (% correct).
* DPO alignment score (pairwise preference accuracy).
### Results
Anton’s DPO variant showed improved *alignment consistency* and *format stability* over SFT-only checkpoints, especially on EPFL test sets.
However, general MCQA accuracy remained comparable to other team models.
#### Summary
* Improved coherence and answer formatting.
* Better preference-following vs. SFT baseline.
* Slightly slower inference (due to full precision and DPO reference model usage).
---
## Technical Specifications
### Model Architecture and Objective
Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters.
Trained with:
* SFT (next-token prediction).
* DPO (pairwise reward alignment on chosen/rejected responses).
#### Software
Hugging Face TRL, PEFT, Transformers, PyTorch.
---
## Glossary
* **MCQA:** Multiple-choice question answering.
* **SFT:** Supervised finetuning on labeled answers.
* **DPO:** Direct Preference Optimization for alignment with human preferences.
* **LoRA:** Low-Rank Adaptation for efficient fine-tuning.
|