EduDolphin 🐬📚

EduDolphin Logo

A fine‑tuned Llama 3.1 8B model specialized for learning analytics and academic insights.

TL;DR — EduDolphin analyzes educational datasets to surface patterns in student performance, engagement (VLE), demographics, and assessment design. Trained on carefully crafted prompts derived from OULAD. Use the Alpaca‑style prompt template below.

Model Summary

Developer: Matteo Angeloni (@matteoangeloni)
Base model: meta-llama/Meta-Llama-3.1-8B
Method: LoRA fine‑tuning with Unsloth + TRL
Primary artifact: merged FP16 (safetensors)
Other artifacts: LoRA adapters; optional 4‑bit merged (env‑sensitive)
Languages: English
Domain: Educational Data / Learning Analytics
License: Llama 3 — access requires accepting Meta’s license on the Hub (gated)

Intended Uses

Primary

Learning Analytics: detect performance patterns, retention risks, intervention windows.
Assessment Analytics: reason over assessment types (TMA/CMA/exams), timing, grade distributions.
Demographics & Equity: surface correlations and disparities in outcomes.
VLE Behavior: interpret clickstream/engagement sequences across weeks and materials.
Academic Planning: support course design decisions with evidence‑oriented insights.

Limitations / Out‑of‑Scope

High‑stakes automated decision‑making without human review.
Any non‑anonymized student data processing (you must anonymize upstream).
General domain tasks unrelated to education (the model is domain‑biased).

Prompting Format (Alpaca)

Use this template for best results:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

Minimal Example

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

MODEL = "matteoangeloni/EduDolphin"

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

prompt = (
    "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.

"
    "### Instruction:
"
    "Task: Assessment Performance Analysis for Module AAA (Category: Learning Analytics)

"
    "### Input:
"
    "Analyze the assessment performance data for module AAA. We have 2,847 total submissions "
    "with an average score of 67.3% and a pass rate of 71.2%. What insights can you derive?

"
    "### Response:
"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# (optional) override default generation settings
model.generation_config = GenerationConfig(max_new_tokens=256, temperature=0.7, top_p=0.9)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files & Variants

Artifact	Purpose	Notes
FP16 merged (default)	Ready‑to‑use full model with LoRA merged	Recommended for most users; broad backend support
LoRA adapters	Combine with base `meta-llama/Meta-Llama-3.1-8B`	Smaller download; flexible for further finetuning
4‑bit merged (optional)	Lower footprint	Requires `bitsandbytes`; not all runtimes (e.g., some TGI/TEI)

Always distribute tokenizer and a generation_config.json alongside weights to avoid inference mismatches.

Training Data

Source: Open University Learning Analytics Dataset (OULAD)

Underlying tables (original OULAD):

~173,912 student assessment records
~10,655,280 VLE interaction logs
~32,593 student demographic profiles
6,364 learning material records
206 assessment configurations

Prompt dataset (derived from OULAD): 6,215 examples total

Train: 5,593
Validation: 622

Categories covered (examples):

Individual Material Analytics (4,781)
Weekly Engagement Analytics (878)
Complex Demographic Analytics (353)
Granular Performance Analytics (64)
Submission Timing Analytics (38)
Click Behavior Analytics (35)
Learning Journey Analytics (33)
Registration Timing Analytics (33)

Notes: Data were anonymized/aggregated for prompt construction. No raw personal identifiers are included.

Training Procedure

Framework: Unsloth + Hugging Face TRL
Base Model: Llama 3.1 8B
Finetuning: LoRA
Epochs: 2
Batch size (per device): 8
Gradient Accumulation: 8
Learning Rate: 2e-5
Max Seq Len: 1024
Optimizer: AdamW (8‑bit)
Speed‑ups: Unsloth (~2× faster)

Export & Publishing

Publish FP16 merged as the primary artifact.
Also publish LoRA adapters for flexibility.
4‑bit merged is optional and environment‑sensitive.
Include tokenizer/ and generation_config.json in each artifact folder.

Evaluation (Current Status)

No standardized benchmark is reported yet. Internal checks focused on:

Faithfulness of schema‑aware reasoning over OULAD‑like contexts
Consistency of recommendations given aggregate statistics
Stability under temperature variations (0.2–0.9)

Community PRs with rigorous evaluation suites are welcome.

Ethical Considerations

Privacy: Use only anonymized/aggregated student data. Comply with GDPR/institutional policies.
Bias & Fairness: OULAD reflects a specific context; validate insights locally before action.
Human Oversight: Treat outputs as decision support, not decisions.
Transparency: Disclose AI assistance in analyses/reports.

Security & Access

Do NOT hard‑code tokens. Use env vars (e.g., HF_TOKEN). Revoke any exposed token immediately.
License: Llama 3. Users must accept Meta’s license on the Hub. Consider enabling gated access.

How to Cite

@misc{angeloni2024edudolphin,
  title        = {EduDolphin: A Fine-tuned Language Model for Educational Data Analysis},
  author       = {Matteo Angeloni},
  year         = {2024},
  howpublished = {Hugging Face Model Hub},
  url          = {https://huggingface.co/matteoangeloni/EduDolphin}
}

Acknowledgments

Thanks to Unsloth for efficient fine‑tuning tooling, Hugging Face TRL for training utilities, and OULAD for the public dataset.

Quick Setup

pip install --upgrade transformers accelerate
# Optional (for 4-bit merges)
pip install bitsandbytes

Downloads last month: 15

Safetensors

Model size

8B params

Tensor type

F16