File size: 5,533 Bytes

a6c6e79
 
e851b13
 
 
 
 
a6c6e79
 
 
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
 
 
a6c6e79
 
 
 
 
e851b13
a6c6e79
e851b13
 
 
 
 
 
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
 
a6c6e79
e851b13
a6c6e79
 
 
 
 
e851b13
 
 
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
 
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
 
 
 
 
 
 
 
 
 
 
 
a6c6e79
 
 
 
 
e851b13
a6c6e79
e851b13
 
 
 
a6c6e79
 
 
e851b13
a6c6e79
e851b13
 
 
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
 
 
 
 
a6c6e79
e851b13
a6c6e79
 
 
 
 
 
 
e851b13
a6c6e79
 
 
e851b13
 
a6c6e79
 
 
e851b13
 
a6c6e79
 
 
e851b13
 
 
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
 
 
e851b13
 
a6c6e79
e851b13
 
a6c6e79
 
 
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
a6c6e79
e851b13
 
 
 
a6c6e79

---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---


# Model Card for ShAIkespear/Phi-2_DPO_Anton

A LoRA-finetuned variant of **microsoft/phi-2** targeting multiple-choice question answering (MCQA) tasks, developed as part of the *ShAIkespear* EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets.

This checkpoint emphasizes alignment on *preference-based fine-tuning* using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations.

---

## Model Details

### Model Description

This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the *SFT → DPO* sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization.

* **Developed by:** ShAIkespear team 
* **Shared by:** ShAIkespear team
* **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model
* **Language(s):** English
* **License:** MIT
* **Finetuned from model:** microsoft/phi-2

---

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”

---

## Uses

### Direct Use

* Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA).
* Research into DPO-aligned fine-tuning and human preference optimization.
* Exploration of prompt-format sensitivity for educational question-answering.

### Out-of-Scope Use

* Critical decision-making (medical, legal, financial) without human oversight.
* Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA).
* Any form of automated exam-taking or misuse of assessment materials.

---

## Bias, Risks, and Limitations

* **Performance Variance:** STEM questions remain challenging — accuracy near random (~0.25) on complex math/science.
* **Overalignment:** DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness.
* **Data Sensitivity:** Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool.

### Recommendations

* Use for controlled research or tutoring scenarios, not autonomous grading.
* Always combine with explicit answer-format prompting (e.g., “### Answer: A”).
* Include human oversight for content validation or pedagogical deployment.

---

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_Anton"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Training Details

### Training Data

Fine-tuning and preference alignment performed on a mix of:

* **Public MCQA datasets:** MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K.
* **Private data:** EPFL student-curated MCQA and preference pairs (~20–30k).
* Filtered for ≤512 tokens; each dataset truncated to 20k samples.
* Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out.

### Training Procedure

#### Preprocessing

Unified MCQA schema (id, subject, question, choices, correct answer).
Prompts structured as:
`### Question ... ### Explanation ... ### Answer`

For DPO:

* **Chosen/Rejected** preference pairs constructed from student feedback or model-comparison data.

#### Training Hyperparameters

* **Regime:** Mixed precision (fp16/bf16).
* **LoRA:** rank=16, α=16, dropout=0.05.
* **Batch size:** SFT = 4; DPO = 1.
* **Learning rate:** 1e-5 for public data; 1e-4 for EPFL preference data.
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face TRL + PEFT/LoRA.

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits.

#### Metrics

* MCQA accuracy (% correct).
* DPO alignment score (pairwise preference accuracy).

### Results

Anton’s DPO variant showed improved *alignment consistency* and *format stability* over SFT-only checkpoints, especially on EPFL test sets.
However, general MCQA accuracy remained comparable to other team models.

#### Summary

* Improved coherence and answer formatting.
* Better preference-following vs. SFT baseline.
* Slightly slower inference (due to full precision and DPO reference model usage).

---

## Technical Specifications

### Model Architecture and Objective

Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters.
Trained with:

* SFT (next-token prediction).
* DPO (pairwise reward alignment on chosen/rejected responses).

#### Software

Hugging Face TRL, PEFT, Transformers, PyTorch.

---

## Glossary

* **MCQA:** Multiple-choice question answering.
* **SFT:** Supervised finetuning on labeled answers.
* **DPO:** Direct Preference Optimization for alignment with human preferences.
* **LoRA:** Low-Rank Adaptation for efficient fine-tuning.