Model Card for Qwen2.5-Math-Instruct-Distill-Phi2-2.5K-Mixed

This model is a version of microsoft/phi-2 that has been fine-tuned using knowledge distillation. The goal was to teach the compact and efficient Phi-2 "student" model to replicate the step-by-step mathematical reasoning style of the powerful Qwen/Qwen2.5-Math-7B-Instruct "teacher" model.

This is the V2 of this project, featuring a significantly larger and more diverse dataset than the V1 model, resulting in more robust reasoning and the ability to correctly generate LaTeX math notation.

Model Details

Model Description

This project explores style distillation, where a smaller model is trained not just on correct answers, but on the process and format of a larger, more capable model's output. The primary objective was to transfer the teacher's verbose, step-by-step reasoning methodology, including its use of LaTeX, to the student model.

The model was fine-tuned using the QLoRA method for high memory efficiency, making it possible to train on consumer-grade hardware.

  • Developed by: Dery Ferd
  • Model type: Causal Decoder-Only Transformer
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: microsoft/phi-2
  • Demo: [Link to your Gradio Demo if you deploy it on Spaces]

How to Get Started with the Model

Use the code below to load the fine-tuned model adapter and run inference.

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Your repository ID
repo_id = "DeryFerd/Qwen2.5-Math-Instruct-Distill-Phi2-2.5K-Mixed"
base_model_id = "microsoft/phi-2"

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)

# Load the PEFT model by merging the adapter into the base model
model = PeftModel.from_pretrained(base_model, repo_id)
model.eval()

# --- Run Inference ---
instruction = "A right-angled triangle has two shorter sides with lengths of 8 cm and 15 cm. What is the length of the longest side (the hypotenuse)? Use the Pythagorean theorem (a^2 + b^2 = c^2) to solve it."
prompt = f"Instruct: {instruction.strip()}\nOutput:"

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

final_answer = response.split("Output:")[1].strip()
print(final_answer)

Training Details

Training Data

The model was trained on a combined dataset of 2,500 examples created specifically for this distillation task. The dataset is a mix of two sources to improve robustness and mathematical formatting capabilities:

  • 2,000 examples from GSM8K: A popular dataset of grade-school math word problems that require multi-step arithmetic reasoning.
  • 500 examples from MetaMathQA: A high-quality dataset covering a broader range of math topics, including algebra and more complex notation.

The answers ("response" column) were not the original dataset answers, but were synthetically generated by the Qwen/Qwen2.5-Math-7B-Instruct teacher model to provide high-quality, step-by-step reasoning examples for the student to learn from.

Training Procedure

The model was fine-tuned using the SFTTrainer from the TRL library on a single NVIDIA T4 GPU in a Kaggle Notebook environment.

Preprocessing

Each data sample was formatted into a single string using the following template, which is suitable for the Phi-2 model: Instruct: {instruction}\nOutput: {response}<|endoftext|>

Training Hyperparameters

  • Framework: Transformers, PEFT (QLoRA), TRL
  • Quantization: 4-bit nf4 with float16 compute dtype
  • LoRA r: 16
  • LoRA alpha: 32
  • LoRA target_modules: ["q_proj", "k_proj", "v_proj", "dense"]
  • Batch Size: per_device_train_batch_size = 1, gradient_accumulation_steps = 8 (effective batch size of 8)
  • Optimizer: paged_adamw_8bit
  • Learning Rate: 2e-4 with a constant scheduler
  • Epochs: 3
  • precision: fp16

Evaluation

Results & Summary

Evaluation was performed qualitatively by comparing the outputs of three models (Base Phi-2, this fine-tuned Student model, and the Teacher Qwen model) on a variety of math problems.

  • Success in Style Transfer: The model successfully adopted the verbose, step-by-step reasoning style of the teacher model. This is a significant improvement over the base Phi-2, which tends to provide short, direct answers without showing its work.
  • Success in LaTeX Generation: A key failure of the V1 model was its inability to correctly generate LaTeX math notation. This V2 model, trained on a more diverse dataset, successfully generates LaTeX for equations, fractions, and exponents, mirroring the teacher's output format.
  • Efficiency Gains: As expected from distillation, this student model provides its detailed, high-quality answers significantly faster and using fewer generated tokens than the much larger 7B teacher model, demonstrating the core benefit of this project.

Bias, Risks, and Limitations

This is an experimental model trained for a specific purpose and has several limitations:

  • It inherits any biases present in the base microsoft/phi-2 and teacher Qwen/Qwen2.5-Math-7B-Instruct models.
  • Its primary training objective was style imitation, not necessarily improving raw mathematical accuracy. It may produce plausible-sounding but mathematically incorrect reasoning.
  • Its knowledge is limited to the patterns within the 2,500 training examples. It may not generalize well to math problems far outside the scope of GSM8K and MetaMathQA (e.g., advanced calculus).

This model should not be used for production or critical applications. It is intended as a portfolio project to demonstrate the effectiveness of knowledge distillation.

More Information [LinkedIn]

[My LinkedIn]

Developer Contact [Github]

[My Github]

Downloads last month
6
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DeryFerd/Qwen2.5-Math-7B-Instruct-Distill-Phi2-2.5K-MixMath

Base model

microsoft/phi-2
Finetuned
(392)
this model

Datasets used to train DeryFerd/Qwen2.5-Math-7B-Instruct-Distill-Phi2-2.5K-MixMath