EduDolphin 🐬📚
A fine‑tuned Llama 3.1 8B model specialized for learning analytics and academic insights.
TL;DR — EduDolphin analyzes educational datasets to surface patterns in student performance, engagement (VLE), demographics, and assessment design. Trained on carefully crafted prompts derived from OULAD. Use the Alpaca‑style prompt template below.
Model Summary
- Developer: Matteo Angeloni (@matteoangeloni)
- Base model:
meta-llama/Meta-Llama-3.1-8B - Method: LoRA fine‑tuning with Unsloth + TRL
- Primary artifact: merged FP16 (safetensors)
- Other artifacts: LoRA adapters; optional 4‑bit merged (env‑sensitive)
- Languages: English
- Domain: Educational Data / Learning Analytics
- License: Llama 3 — access requires accepting Meta’s license on the Hub (gated)
Intended Uses
Primary
- Learning Analytics: detect performance patterns, retention risks, intervention windows.
- Assessment Analytics: reason over assessment types (TMA/CMA/exams), timing, grade distributions.
- Demographics & Equity: surface correlations and disparities in outcomes.
- VLE Behavior: interpret clickstream/engagement sequences across weeks and materials.
- Academic Planning: support course design decisions with evidence‑oriented insights.
Limitations / Out‑of‑Scope
- High‑stakes automated decision‑making without human review.
- Any non‑anonymized student data processing (you must anonymize upstream).
- General domain tasks unrelated to education (the model is domain‑biased).
Prompting Format (Alpaca)
Use this template for best results:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
Minimal Example
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
MODEL = "matteoangeloni/EduDolphin"
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
prompt = (
"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.
"
"### Instruction:
"
"Task: Assessment Performance Analysis for Module AAA (Category: Learning Analytics)
"
"### Input:
"
"Analyze the assessment performance data for module AAA. We have 2,847 total submissions "
"with an average score of 67.3% and a pass rate of 71.2%. What insights can you derive?
"
"### Response:
"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# (optional) override default generation settings
model.generation_config = GenerationConfig(max_new_tokens=256, temperature=0.7, top_p=0.9)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Files & Variants
| Artifact | Purpose | Notes |
|---|---|---|
| FP16 merged (default) | Ready‑to‑use full model with LoRA merged | Recommended for most users; broad backend support |
| LoRA adapters | Combine with base meta-llama/Meta-Llama-3.1-8B |
Smaller download; flexible for further finetuning |
| 4‑bit merged (optional) | Lower footprint | Requires bitsandbytes; not all runtimes (e.g., some TGI/TEI) |
Always distribute tokenizer and a generation_config.json alongside weights to avoid inference mismatches.
Training Data
Source: Open University Learning Analytics Dataset (OULAD)
Underlying tables (original OULAD):
- ~173,912 student assessment records
- ~10,655,280 VLE interaction logs
- ~32,593 student demographic profiles
- 6,364 learning material records
- 206 assessment configurations
Prompt dataset (derived from OULAD): 6,215 examples total
- Train: 5,593
- Validation: 622
Categories covered (examples):
- Individual Material Analytics (4,781)
- Weekly Engagement Analytics (878)
- Complex Demographic Analytics (353)
- Granular Performance Analytics (64)
- Submission Timing Analytics (38)
- Click Behavior Analytics (35)
- Learning Journey Analytics (33)
- Registration Timing Analytics (33)
Notes: Data were anonymized/aggregated for prompt construction. No raw personal identifiers are included.
Training Procedure
- Framework: Unsloth + Hugging Face TRL
- Base Model: Llama 3.1 8B
- Finetuning: LoRA
- Epochs: 2
- Batch size (per device): 8
- Gradient Accumulation: 8
- Learning Rate: 2e-5
- Max Seq Len: 1024
- Optimizer: AdamW (8‑bit)
- Speed‑ups: Unsloth (~2× faster)
Export & Publishing
- Publish FP16 merged as the primary artifact.
- Also publish LoRA adapters for flexibility.
- 4‑bit merged is optional and environment‑sensitive.
- Include
tokenizer/andgeneration_config.jsonin each artifact folder.
Evaluation (Current Status)
No standardized benchmark is reported yet. Internal checks focused on:
- Faithfulness of schema‑aware reasoning over OULAD‑like contexts
- Consistency of recommendations given aggregate statistics
- Stability under temperature variations (0.2–0.9)
Community PRs with rigorous evaluation suites are welcome.
Ethical Considerations
- Privacy: Use only anonymized/aggregated student data. Comply with GDPR/institutional policies.
- Bias & Fairness: OULAD reflects a specific context; validate insights locally before action.
- Human Oversight: Treat outputs as decision support, not decisions.
- Transparency: Disclose AI assistance in analyses/reports.
Security & Access
- Do NOT hard‑code tokens. Use env vars (e.g.,
HF_TOKEN). Revoke any exposed token immediately. - License: Llama 3. Users must accept Meta’s license on the Hub. Consider enabling gated access.
How to Cite
@misc{angeloni2024edudolphin,
title = {EduDolphin: A Fine-tuned Language Model for Educational Data Analysis},
author = {Matteo Angeloni},
year = {2024},
howpublished = {Hugging Face Model Hub},
url = {https://huggingface.co/matteoangeloni/EduDolphin}
}
Acknowledgments
Thanks to Unsloth for efficient fine‑tuning tooling, Hugging Face TRL for training utilities, and OULAD for the public dataset.
Quick Setup
pip install --upgrade transformers accelerate
# Optional (for 4-bit merges)
pip install bitsandbytes
- Downloads last month
- 15
