---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---

# Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt

A **LoRA-finetuned** and **Direct Preference Optimization (DPO)**–aligned variant of **microsoft/phi-2**, specialized for **multiple-choice question answering (MCQA)** with an emphasis on **STEM and general knowledge** domains.
This model represents the *alternative base configuration* of the final **M3 (balanced-then-DPO)** training pipeline from the *ShAIkespear* project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization.

---

## Model Details

* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal LM (Phi-2) with LoRA adapters; DPO-aligned
* **Languages:** English
* **License:** MIT
* **Finetuned from:** microsoft/phi-2

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** *“ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”*

---

## Uses

### Direct Use

* MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA).
* Alignment research — comparison between DPO training setups (Base vs. Quantized).
* As a **high-fidelity reference checkpoint** for quantized and downstream variants.

### Out-of-Scope Use

* High-stakes or safety-critical applications (medical, legal, policy).
* Generative tasks outside multiple-choice reasoning.
* Misuse in automated exam solving or confidential data leakage.

---

## Bias, Risks, and Limitations

* **Domain bias:** Stronger on factual MCQA, weaker on advanced reasoning tasks.
* **Answer drift:** May occasionally produce verbose or follow-up answers without explicit formatting.
* **Data source risks:** EPFL-derived preferences may encode narrow style biases.

### Recommendations

* Maintain the structured prompt format:

  ```
  ### Question ...
  ### Explanation ...
  ### Answer:
  ```
* Keep human supervision in any educational or grading use.
* Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment.

---

## How to Get Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Training Details

### Training Data

* **SFT stage:** Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets.
* **DPO stage:** Human preference pairs (EPFL exams + public feedback datasets like HelpSteer).
* **Schema:** Unified “### Question / ### Explanation / ### Answer” format.
* **Filtering:** ≤512 tokens, balanced sample caps (~20k per dataset).

### Training Procedure

* **Pipeline:** SFT → DPO (M3 configuration).
* **LoRA parameters:** rank = 16, α = 16, dropout = 0.05.
* **Batch sizes:** SFT = 4; DPO = 1.
* **Learning rates:** 1e-5 (public) / 1e-4 (EPFL).
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face Transformers + TRL + PEFT (LoRA).

---

## Evaluation Summary

* **Configuration:** *M3 Base (Alt)* is the unquantized reference model for the quantized 8-bit variant.
* **Performance:** Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment.
* **Accuracy:** Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks.
* **Use case:** For experimentation, evaluation, or further domain-specific fine-tuning.

---

## Technical Specifications

* **Architecture:** Phi-2 (~2.78B parameters), decoder-only transformer.
* **Objective:** SFT next-token prediction + DPO preference alignment.
* **Precision:** Full precision (fp16/bf16).
* **Software:** Hugging Face Transformers, TRL, PEFT.

---

## Glossary

* **MCQA:** Multiple-Choice Question Answering
* **SFT:** Supervised Finetuning
* **DPO:** Direct Preference Optimization
* **LoRA:** Low-Rank Adaptation
* **Alt (Alternative):** Internal naming for the alternate full-precision checkpoint variant of M3