Model Card for Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)

A lightweight LoRA-adapter fine-tune of microsoft/Phi-3-mini-4k-instruct for turning structured lab contexts + observations into executable Python code that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an assistive code generator for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.

Model Details

Model Description

Developed by: Barghav777
Model type: Causal decoder LM (instruction-tuned) + LoRA adapter
Languages: English
License: MIT
Finetuned from: microsoft/Phi-3-mini-4k-instruct
Intended input format: A structured prompt with:
- ### CONTEXT: (natural-language description of the experiment)
- ### OBSERVATIONS: (JSON-like dict with units, readings)
- ### CODE: (the model is trained to generate the Python solution after this tag)

Model Sources

Base model: microsoft/Phi-3-mini-4k-instruct
Training data files: train.jsonl (37 items), eval.jsonl (6 items)
Demo/Colab basis: Training notebook available at: https://github.com/Barghav777/AI-Lab-Report-Agent

Uses

Direct Use

Generate readable Python code to compute derived quantities from lab observations (e.g., average (g) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.

Downstream Use

Course assistants or lab-prep tools that auto-draft calculation code for intro undergrad physics/mech/fluids/EE labs.
Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).

Out-of-Scope Use

Any safety-critical design decisions (structural, medical, chemical process control).
High-stakes computation without human verification.
Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).

Bias, Risks, and Limitations

Small dataset (37 train / 6 eval) → plausible overfitting; brittle generalization to unseen experiment formats.
Formula misuse risk: The model may pick incorrect constants/units or silently use wrong equations.
Overconfidence: Generated code may “look right” while being numerically off or unit-inconsistent.
JSON brittleness: If OBSERVATIONS keys/units differ from training patterns, the code may break.

Recommendations

Always review formulas and units; add assertions/unit conversions in downstream systems.
Run generated code with test observations and compare against hand calculations.
For deployment, wrap outputs with explanations and references to the formulas used.

How to Get Started

Prompt template used in training

### CONTEXT:
{context}

### OBSERVATIONS:
{observations}

### CODE:

Load base + LoRA adapter (recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
from peft import PeftModel
import torch

base_id = "microsoft/Phi-3-mini-4k-instruct"
adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH"  # e.g., ./phi3-lab-report-coder-final

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)

tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
                                            trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

prompt = """### CONTEXT:
Experiment to determine acceleration due to gravity using a simple pendulum...

### OBSERVATIONS:
{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}

### CODE:
"""

inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)

Training Details

Data

Files: train.jsonl (list of objects), eval.jsonl (list of objects)
Schema per example:
- context (str): experiment description
- observations (dict): units + numeric readings (lists of dicts)
- code (str): reference Python solution
Topical spread (non-exhaustive): pendulum (g), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.

Size & basic stats

Train: 37 items; Eval: 6 items
Formatted prompt (context+observations+code) length (train):
- mean ≈ 222 words (≈ 1,739 chars); 95th pct ≈ 311 words
Reference code length (train):
- mean ≈ 34 lines (min 9, max 71)

Training Procedure (from notebook)

Approach: QLoRA (4-bit) SFT using trl.SFTTrainer
Quantization: bitsandbytes 4-bit nf4, compute dtype bfloat16
LoRA config: r=16, alpha=32, dropout=0.05, bias="none", targets = q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
Tokenizer: right padding; eos_token as pad_token
Hyperparameters (TrainingArguments):
- epochs: 10
- per-device train batch size: 1
- gradient_accumulation_steps: 4
- optimizer: paged_adamw_32bit
- learning rate: 2e-4, weight decay: 1e-3
- warmup_ratio: 0.03, scheduler: constant
- bf16: True (fp16: False), group_by_length: True
- logging_steps: 10, save/eval every 50 steps
- report_to: tensorboard
Saving: trainer.save_model("./phi3-lab-report-coder-final") (adapter folder)

Speeds, Sizes, Times

Hardware: Google Colab T4 GPU (per notebook metadata)
Adapter artifact: LoRA weights only (load with the base model).
Wall-clock time: not logged in the notebook.

Evaluation

Testing Data, Factors & Metrics

Eval set: eval.jsonl (6 items) with same schema.
Primary metric (planned): ROUGE-L / ROUGE-1 against reference code (proxy for surface similarity).
Recommended additional checks: unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.

Results

No automated score recorded in the notebook.
Suggested protocol:
1. Generate code for each eval item using the same prompt template.
2. Execute safely in a sandbox with provided observations.
3. Compare computed scalars (e.g., average (g), (R), Reynolds number) to ground truth tolerances.
4. Report pass rate and ROUGE for readability/similarity.

Model Examination (optional)

Inspect token-by-token attention to OBSERVATIONS keys (ablation: shuffle keys to test robustness).
Add unit-check helpers (e.g., pint) in prompts to encourage explicit conversions.

Environmental Impact

Hardware Type: NVIDIA T4 (Colab)
Precision: 4-bit QLoRA with bfloat16 compute
Hours used: Not recorded (dataset is small; expected low)
Cloud Provider/Region: Colab (unspecified)
Carbon Emitted: Not estimated (see ML CO2 Impact calculator)

Technical Specifications

Architecture & Objective

Backbone: Phi-3-mini-4k-instruct (decoder-only causal LM)
Objective: Supervised fine-tuning to continue from ### CODE: with correct, executable Python.

Compute Infrastructure

Hardware: Colab GPU (T4) + CPU RAM
Software:
- transformers, trl, peft, bitsandbytes, datasets, accelerate, torch

Citation

@article{abdin2024phi3, title = {Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone}, author = {Abdin, Marah and others}, journal = {arXiv preprint arXiv:2404.14219}, year = {2024}, doi = {10.48550/arXiv.2404.14219}, url = {https://arxiv.org/abs/2404.14219} }

Glossary

QLoRA: Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).
LoRA (r, α): Rank and scaling of low-rank update matrices.

More Information

For better robustness, consider augmenting data with unit-perturbation and noise-in-readings variants, and add examples across more domains (materials, thermo, optics).
Add eval harness with numeric tolerances and syntax checks.

Model Card Authors

Barghav777

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

F16

Model tree for Barghav777/phi3-lab-report-coder

Base model

microsoft/Phi-3-mini-4k-instruct

Finetuned

(380)

this model