Model Card for ShAIkespear/Phi-2_DPO_Anton

A LoRA-finetuned variant of microsoft/phi-2 targeting multiple-choice question answering (MCQA) tasks, developed as part of the ShAIkespear EPFL project. This model was trained and aligned using Direct Preference Optimization (DPO) after Supervised Finetuning (SFT) on general and STEM MCQA datasets.

This checkpoint emphasizes alignment on preference-based fine-tuning using DPO, focusing on improving consistency and correctness on EPFL-style MCQ evaluations.


Model Details

Model Description

This model extends Phi-2 (2.78B parameters, 2,048 context length) using LoRA adapters (rank=16, α=16, dropout=0.05). Training followed the SFT → DPO sequence, leveraging Hugging Face TRL and PEFT frameworks. The model was trained at full precision, without quantization.

  • Developed by: ShAIkespear team
  • Shared by: ShAIkespear team
  • Model type: Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA model
  • Language(s): English
  • License: MIT
  • Finetuned from model: microsoft/phi-2

Model Sources

  • Repository: 2.8B-Phi-2-LLM-QA
  • Report: “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”

Uses

Direct Use

  • Multiple-choice question answering (MCQA) for educational and benchmarking contexts (e.g., MMLU, ScienceQA).
  • Research into DPO-aligned fine-tuning and human preference optimization.
  • Exploration of prompt-format sensitivity for educational question-answering.

Out-of-Scope Use

  • Critical decision-making (medical, legal, financial) without human oversight.
  • Generative free-form writing or extended reasoning tasks (model tuned specifically for MCQA).
  • Any form of automated exam-taking or misuse of assessment materials.

Bias, Risks, and Limitations

  • Performance Variance: STEM questions remain challenging — accuracy near random (~0.25) on complex math/science.
  • Overalignment: DPO training may overly prioritize stylistic preference patterns, slightly reducing factual robustness.
  • Data Sensitivity: Includes human preference data from EPFL coursework; may reflect biases of a limited annotator pool.

Recommendations

  • Use for controlled research or tutoring scenarios, not autonomous grading.
  • Always combine with explicit answer-format prompting (e.g., “### Answer: A”).
  • Include human oversight for content validation or pedagogical deployment.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_Anton"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which gas is most abundant in Earth's atmosphere?\n### Explanation: Identify the main atmospheric component.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))

Training Details

Training Data

Fine-tuning and preference alignment performed on a mix of:

  • Public MCQA datasets: MathQA, OpenBookQA, ScienceQA, and TAL-SCQ5K.
  • Private data: EPFL student-curated MCQA and preference pairs (~20–30k).
  • Filtered for ≤512 tokens; each dataset truncated to 20k samples.
  • Splits roughly: 50% train, 25% test_overfit, 10% comparison, 15% held-out.

Training Procedure

Preprocessing

Unified MCQA schema (id, subject, question, choices, correct answer). Prompts structured as: ### Question ... ### Explanation ... ### Answer

For DPO:

  • Chosen/Rejected preference pairs constructed from student feedback or model-comparison data.

Training Hyperparameters

  • Regime: Mixed precision (fp16/bf16).
  • LoRA: rank=16, α=16, dropout=0.05.
  • Batch size: SFT = 4; DPO = 1.
  • Learning rate: 1e-5 for public data; 1e-4 for EPFL preference data.
  • Scheduler: Cosine with warmup.
  • Frameworks: Hugging Face TRL + PEFT/LoRA.

Evaluation

Testing Data, Factors & Metrics

Testing Data

MMLU (STEM subset), EPFL preference test sets, and held-out comparison splits.

Metrics

  • MCQA accuracy (% correct).
  • DPO alignment score (pairwise preference accuracy).

Results

Anton’s DPO variant showed improved alignment consistency and format stability over SFT-only checkpoints, especially on EPFL test sets. However, general MCQA accuracy remained comparable to other team models.

Summary

  • Improved coherence and answer formatting.
  • Better preference-following vs. SFT baseline.
  • Slightly slower inference (due to full precision and DPO reference model usage).

Technical Specifications

Model Architecture and Objective

Phi-2 transformer decoder LM (~2.78B params) with LoRA adapters. Trained with:

  • SFT (next-token prediction).
  • DPO (pairwise reward alignment on chosen/rejected responses).

Software

Hugging Face TRL, PEFT, Transformers, PyTorch.


Glossary

  • MCQA: Multiple-choice question answering.
  • SFT: Supervised finetuning on labeled answers.
  • DPO: Direct Preference Optimization for alignment with human preferences.
  • LoRA: Low-Rank Adaptation for efficient fine-tuning.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShAIkespear/Phi-2_DPO_Anton

Base model

microsoft/phi-2
Finetuned
(399)
this model

Collection including ShAIkespear/Phi-2_DPO_Anton