---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---


# Model Card for ShAIkespear/Phi-2_DPO_Anton

A LoRA-finetuned variant of **microsoft/phi-2** targeting multiple-choice question answering (MCQA) tasks, developed as part of the *ShAIkespear* EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets.

This checkpoint emphasizes alignment on *preference-based fine-tuning* using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations.

---

## Model Details

### Model Description

This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the *SFT → DPO* sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization.

* **Developed by:** ShAIkespear team 
* **Shared by:** ShAIkespear team
* **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model
* **Language(s):** English
* **License:** MIT
* **Finetuned from model:** microsoft/phi-2

---

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”

---

## Uses

### Direct Use

* Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA).
* Research into DPO-aligned fine-tuning and human preference optimization.
* Exploration of prompt-format sensitivity for educational question-answering.

### Out-of-Scope Use

* Critical decision-making (medical, legal, financial) without human oversight.
* Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA).
* Any form of automated exam-taking or misuse of assessment materials.

---

## Bias, Risks, and Limitations

* **Performance Variance:** STEM questions remain challenging — accuracy near random (~0.25) on complex math/science.
* **Overalignment:** DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness.
* **Data Sensitivity:** Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool.

### Recommendations

* Use for controlled research or tutoring scenarios, not autonomous grading.
* Always combine with explicit answer-format prompting (e.g., “### Answer: A”).
* Include human oversight for content validation or pedagogical deployment.

---

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_Anton"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Training Details

### Training Data

Fine-tuning and preference alignment performed on a mix of:

* **Public MCQA datasets:** MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K.
* **Private data:** EPFL student-curated MCQA and preference pairs (~20–30k).
* Filtered for ≤512 tokens; each dataset truncated to 20k samples.
* Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out.

### Training Procedure

#### Preprocessing

Unified MCQA schema (id, subject, question, choices, correct answer).
Prompts structured as:
`### Question ... ### Explanation ... ### Answer`

For DPO:

* **Chosen/Rejected** preference pairs constructed from student feedback or model-comparison data.

#### Training Hyperparameters

* **Regime:** Mixed precision (fp16/bf16).
* **LoRA:** rank=16, α=16, dropout=0.05.
* **Batch size:** SFT = 4; DPO = 1.
* **Learning rate:** 1e-5 for public data; 1e-4 for EPFL preference data.
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face TRL + PEFT/LoRA.

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits.

#### Metrics

* MCQA accuracy (% correct).
* DPO alignment score (pairwise preference accuracy).

### Results

Anton’s DPO variant showed improved *alignment consistency* and *format stability* over SFT-only checkpoints, especially on EPFL test sets.
However, general MCQA accuracy remained comparable to other team models.

#### Summary

* Improved coherence and answer formatting.
* Better preference-following vs. SFT baseline.
* Slightly slower inference (due to full precision and DPO reference model usage).

---

## Technical Specifications

### Model Architecture and Objective

Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters.
Trained with:

* SFT (next-token prediction).
* DPO (pairwise reward alignment on chosen/rejected responses).

#### Software

Hugging Face TRL, PEFT, Transformers, PyTorch.

---

## Glossary

* **MCQA:** Multiple-choice question answering.
* **SFT:** Supervised finetuning on labeled answers.
* **DPO:** Direct Preference Optimization for alignment with human preferences.
* **LoRA:** Low-Rank Adaptation for efficient fine-tuning.