Model Card for Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)
A lightweight LoRA-adapter fine-tune of microsoft/Phi-3-mini-4k-instruct for turning structured lab contexts + observations into executable Python code that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an assistive code generator for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.
Model Details
Model Description
- Developed by: Barghav777
- Model type: Causal decoder LM (instruction-tuned) + LoRA adapter
- Languages: English
- License: MIT
- Finetuned from:
microsoft/Phi-3-mini-4k-instruct - Intended input format: A structured prompt with:
### CONTEXT:(natural-language description of the experiment)### OBSERVATIONS:(JSON-like dict with units, readings)### CODE:(the model is trained to generate the Python solution after this tag)
Model Sources
- Base model:
microsoft/Phi-3-mini-4k-instruct - Training data files:
train.jsonl(37 items),eval.jsonl(6 items) - Demo/Colab basis: Training notebook available at: https://github.com/Barghav777/AI-Lab-Report-Agent
Uses
Direct Use
- Generate readable Python code to compute derived quantities from lab observations (e.g., average (g) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
- Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.
Downstream Use
- Course assistants or lab-prep tools that auto-draft calculation code for intro undergrad physics/mech/fluids/EE labs.
- Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).
Out-of-Scope Use
- Any safety-critical design decisions (structural, medical, chemical process control).
- High-stakes computation without human verification.
- Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).
Bias, Risks, and Limitations
- Small dataset (37 train / 6 eval) → plausible overfitting; brittle generalization to unseen experiment formats.
- Formula misuse risk: The model may pick incorrect constants/units or silently use wrong equations.
- Overconfidence: Generated code may “look right” while being numerically off or unit-inconsistent.
- JSON brittleness: If
OBSERVATIONSkeys/units differ from training patterns, the code may break.
Recommendations
- Always review formulas and units; add assertions/unit conversions in downstream systems.
- Run generated code with test observations and compare against hand calculations.
- For deployment, wrap outputs with explanations and references to the formulas used.
How to Get Started
Prompt template used in training
### CONTEXT:
{context}
### OBSERVATIONS:
{observations}
### CODE:
Load base + LoRA adapter (recommended)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
from peft import PeftModel
import torch
base_id = "microsoft/Phi-3-mini-4k-instruct"
adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH" # e.g., ./phi3-lab-report-coder-final
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)
tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()
prompt = """### CONTEXT:
Experiment to determine acceleration due to gravity using a simple pendulum...
### OBSERVATIONS:
{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}
### CODE:
"""
inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)
Training Details
Data
- Files:
train.jsonl(list of objects),eval.jsonl(list of objects) - Schema per example:
context(str): experiment descriptionobservations(dict): units + numeric readings (lists of dicts)code(str): reference Python solution
- Topical spread (non-exhaustive): pendulum (g), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.
Size & basic stats
- Train: 37 items; Eval: 6 items
- Formatted prompt (context+observations+code) length (train):
- mean ≈ 222 words (≈ 1,739 chars); 95th pct ≈ 311 words
- Reference code length (train):
- mean ≈ 34 lines (min 9, max 71)
Training Procedure (from notebook)
- Approach: QLoRA (4-bit) SFT using
trl.SFTTrainer - Quantization:
bitsandbytes4-bitnf4, compute dtypebfloat16 - LoRA config:
r=16,alpha=32,dropout=0.05,bias="none", targets =q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Tokenizer: right padding;
eos_tokenaspad_token - Hyperparameters (TrainingArguments):
- epochs: 10
- per-device train batch size: 1
- gradient_accumulation_steps: 4
- optimizer: paged_adamw_32bit
- learning rate: 2e-4, weight decay: 1e-3
- warmup_ratio: 0.03, scheduler: constant
- bf16: True (fp16: False), group_by_length: True
- logging_steps: 10, save/eval every 50 steps
- report_to: tensorboard
- Saving:
trainer.save_model("./phi3-lab-report-coder-final")(adapter folder)
Speeds, Sizes, Times
- Hardware: Google Colab T4 GPU (per notebook metadata)
- Adapter artifact: LoRA weights only (load with the base model).
- Wall-clock time: not logged in the notebook.
Evaluation
Testing Data, Factors & Metrics
- Eval set:
eval.jsonl(6 items) with same schema. - Primary metric (planned): ROUGE-L / ROUGE-1 against reference
code(proxy for surface similarity). - Recommended additional checks: unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.
Results
- No automated score recorded in the notebook.
- Suggested protocol:
- Generate code for each eval item using the same prompt template.
- Execute safely in a sandbox with provided observations.
- Compare computed scalars (e.g., average (g), (R), Reynolds number) to ground truth tolerances.
- Report pass rate and ROUGE for readability/similarity.
Model Examination (optional)
- Inspect token-by-token attention to
OBSERVATIONSkeys (ablation: shuffle keys to test robustness). - Add unit-check helpers (e.g.,
pint) in prompts to encourage explicit conversions.
Environmental Impact
- Hardware Type: NVIDIA T4 (Colab)
- Precision: 4-bit QLoRA with
bfloat16compute - Hours used: Not recorded (dataset is small; expected low)
- Cloud Provider/Region: Colab (unspecified)
- Carbon Emitted: Not estimated (see ML CO2 Impact calculator)
Technical Specifications
Architecture & Objective
- Backbone:
Phi-3-mini-4k-instruct(decoder-only causal LM) - Objective: Supervised fine-tuning to continue from
### CODE:with correct, executable Python.
Compute Infrastructure
- Hardware: Colab GPU (T4) + CPU RAM
- Software:
transformers,trl,peft,bitsandbytes,datasets,accelerate,torch
Citation
@article{abdin2024phi3, title = {Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone}, author = {Abdin, Marah and others}, journal = {arXiv preprint arXiv:2404.14219}, year = {2024}, doi = {10.48550/arXiv.2404.14219}, url = {https://arxiv.org/abs/2404.14219} }
Glossary
- QLoRA: Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).
- LoRA (r, α): Rank and scaling of low-rank update matrices.
More Information
- For better robustness, consider augmenting data with unit-perturbation and noise-in-readings variants, and add examples across more domains (materials, thermo, optics).
- Add eval harness with numeric tolerances and syntax checks.
Model Card Authors
- Barghav777
- Downloads last month
- 1
Model tree for Barghav777/phi3-lab-report-coder
Base model
microsoft/Phi-3-mini-4k-instruct