File size: 5,533 Bytes
a6c6e79
 
e851b13
 
 
 
 
a6c6e79
 
 
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
 
 
a6c6e79
 
 
 
 
e851b13
a6c6e79
e851b13
 
 
 
 
 
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
 
a6c6e79
e851b13
a6c6e79
 
 
 
 
e851b13
 
 
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
 
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
 
 
 
 
 
 
 
 
 
 
 
a6c6e79
 
 
 
 
e851b13
a6c6e79
e851b13
 
 
 
a6c6e79
 
 
e851b13
a6c6e79
e851b13
 
 
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
 
 
 
 
a6c6e79
e851b13
a6c6e79
 
 
 
 
 
 
e851b13
a6c6e79
 
 
e851b13
 
a6c6e79
 
 
e851b13
 
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
a6c6e79
e851b13
 
a6c6e79
 
 
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
 
 
 
a6c6e79
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---


# Model Card for ShAIkespear/Phi-2_DPO_Anton

A LoRA-finetuned variant of **microsoft/phi-2** targeting multiple-choice question answering (MCQA) tasks, developed as part of the *ShAIkespear* EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets.

This checkpoint emphasizes alignment on *preference-based fine-tuning* using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations.

---

## Model Details

### Model Description

This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the *SFT → DPO* sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization.

* **Developed by:** ShAIkespear team 
* **Shared by:** ShAIkespear team
* **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model
* **Language(s):** English
* **License:** MIT
* **Finetuned from model:** microsoft/phi-2

---

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”

---

## Uses

### Direct Use

* Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA).
* Research into DPO-aligned fine-tuning and human preference optimization.
* Exploration of prompt-format sensitivity for educational question-answering.

### Out-of-Scope Use

* Critical decision-making (medical, legal, financial) without human oversight.
* Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA).
* Any form of automated exam-taking or misuse of assessment materials.

---

## Bias, Risks, and Limitations

* **Performance Variance:** STEM questions remain challenging — accuracy near random (~0.25) on complex math/science.
* **Overalignment:** DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness.
* **Data Sensitivity:** Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool.

### Recommendations

* Use for controlled research or tutoring scenarios, not autonomous grading.
* Always combine with explicit answer-format prompting (e.g., “### Answer: A”).
* Include human oversight for content validation or pedagogical deployment.

---

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_Anton"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Training Details

### Training Data

Fine-tuning and preference alignment performed on a mix of:

* **Public MCQA datasets:** MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K.
* **Private data:** EPFL student-curated MCQA and preference pairs (~20–30k).
* Filtered for ≤512 tokens; each dataset truncated to 20k samples.
* Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out.

### Training Procedure

#### Preprocessing

Unified MCQA schema (id, subject, question, choices, correct answer).
Prompts structured as:
`### Question ... ### Explanation ... ### Answer`

For DPO:

* **Chosen/Rejected** preference pairs constructed from student feedback or model-comparison data.

#### Training Hyperparameters

* **Regime:** Mixed precision (fp16/bf16).
* **LoRA:** rank=16, α=16, dropout=0.05.
* **Batch size:** SFT = 4; DPO = 1.
* **Learning rate:** 1e-5 for public data; 1e-4 for EPFL preference data.
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face TRL + PEFT/LoRA.

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits.

#### Metrics

* MCQA accuracy (% correct).
* DPO alignment score (pairwise preference accuracy).

### Results

Anton’s DPO variant showed improved *alignment consistency* and *format stability* over SFT-only checkpoints, especially on EPFL test sets.
However, general MCQA accuracy remained comparable to other team models.

#### Summary

* Improved coherence and answer formatting.
* Better preference-following vs. SFT baseline.
* Slightly slower inference (due to full precision and DPO reference model usage).

---

## Technical Specifications

### Model Architecture and Objective

Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters.
Trained with:

* SFT (next-token prediction).
* DPO (pairwise reward alignment on chosen/rejected responses).

#### Software

Hugging Face TRL, PEFT, Transformers, PyTorch.

---

## Glossary

* **MCQA:** Multiple-choice question answering.
* **SFT:** Supervised finetuning on labeled answers.
* **DPO:** Direct Preference Optimization for alignment with human preferences.
* **LoRA:** Low-Rank Adaptation for efficient fine-tuning.