Model Card for Qwen2.5-Math-Instruct-Distill-Phi2-2.5K-Mixed
This model is a version of microsoft/phi-2 that has been fine-tuned using knowledge distillation. The goal was to teach the compact and efficient Phi-2 "student" model to replicate the step-by-step mathematical reasoning style of the powerful Qwen/Qwen2.5-Math-7B-Instruct "teacher" model.
This is the V2 of this project, featuring a significantly larger and more diverse dataset than the V1 model, resulting in more robust reasoning and the ability to correctly generate LaTeX math notation.
Model Details
Model Description
This project explores style distillation, where a smaller model is trained not just on correct answers, but on the process and format of a larger, more capable model's output. The primary objective was to transfer the teacher's verbose, step-by-step reasoning methodology, including its use of LaTeX, to the student model.
The model was fine-tuned using the QLoRA method for high memory efficiency, making it possible to train on consumer-grade hardware.
- Developed by: Dery Ferd
- Model type: Causal Decoder-Only Transformer
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: microsoft/phi-2
- Demo: [Link to your Gradio Demo if you deploy it on Spaces]
How to Get Started with the Model
Use the code below to load the fine-tuned model adapter and run inference.
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Your repository ID
repo_id = "DeryFerd/Qwen2.5-Math-Instruct-Distill-Phi2-2.5K-Mixed"
base_model_id = "microsoft/phi-2"
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# Load the PEFT model by merging the adapter into the base model
model = PeftModel.from_pretrained(base_model, repo_id)
model.eval()
# --- Run Inference ---
instruction = "A right-angled triangle has two shorter sides with lengths of 8 cm and 15 cm. What is the length of the longest side (the hypotenuse)? Use the Pythagorean theorem (a^2 + b^2 = c^2) to solve it."
prompt = f"Instruct: {instruction.strip()}\nOutput:"
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
final_answer = response.split("Output:")[1].strip()
print(final_answer)
Training Details
Training Data
The model was trained on a combined dataset of 2,500 examples created specifically for this distillation task. The dataset is a mix of two sources to improve robustness and mathematical formatting capabilities:
- 2,000 examples from GSM8K: A popular dataset of grade-school math word problems that require multi-step arithmetic reasoning.
- 500 examples from MetaMathQA: A high-quality dataset covering a broader range of math topics, including algebra and more complex notation.
The answers ("response" column) were not the original dataset answers, but were synthetically generated by the Qwen/Qwen2.5-Math-7B-Instruct teacher model to provide high-quality, step-by-step reasoning examples for the student to learn from.
Training Procedure
The model was fine-tuned using the SFTTrainer from the TRL library on a single NVIDIA T4 GPU in a Kaggle Notebook environment.
Preprocessing
Each data sample was formatted into a single string using the following template, which is suitable for the Phi-2 model: Instruct: {instruction}\nOutput: {response}<|endoftext|>
Training Hyperparameters
- Framework: Transformers, PEFT (QLoRA), TRL
- Quantization: 4-bit nf4withfloat16compute dtype
- LoRA r: 16
- LoRA alpha: 32
- LoRA target_modules:["q_proj", "k_proj", "v_proj", "dense"]
- Batch Size: per_device_train_batch_size= 1,gradient_accumulation_steps= 8 (effective batch size of 8)
- Optimizer: paged_adamw_8bit
- Learning Rate: 2e-4with a constant scheduler
- Epochs: 3
- precision: fp16
Evaluation
Results & Summary
Evaluation was performed qualitatively by comparing the outputs of three models (Base Phi-2, this fine-tuned Student model, and the Teacher Qwen model) on a variety of math problems.
- Success in Style Transfer: The model successfully adopted the verbose, step-by-step reasoning style of the teacher model. This is a significant improvement over the base Phi-2, which tends to provide short, direct answers without showing its work.
- Success in LaTeX Generation: A key failure of the V1 model was its inability to correctly generate LaTeX math notation. This V2 model, trained on a more diverse dataset, successfully generates LaTeX for equations, fractions, and exponents, mirroring the teacher's output format.
- Efficiency Gains: As expected from distillation, this student model provides its detailed, high-quality answers significantly faster and using fewer generated tokens than the much larger 7B teacher model, demonstrating the core benefit of this project.
Bias, Risks, and Limitations
This is an experimental model trained for a specific purpose and has several limitations:
- It inherits any biases present in the base microsoft/phi-2and teacherQwen/Qwen2.5-Math-7B-Instructmodels.
- Its primary training objective was style imitation, not necessarily improving raw mathematical accuracy. It may produce plausible-sounding but mathematically incorrect reasoning.
- Its knowledge is limited to the patterns within the 2,500 training examples. It may not generalize well to math problems far outside the scope of GSM8K and MetaMathQA (e.g., advanced calculus).
This model should not be used for production or critical applications. It is intended as a portfolio project to demonstrate the effectiveness of knowledge distillation.
More Information [LinkedIn]
Developer Contact [Github]
- Downloads last month
- 6
Model tree for DeryFerd/Qwen2.5-Math-7B-Instruct-Distill-Phi2-2.5K-MixMath
Base model
microsoft/phi-2