phi3-mini-128k-sft-merged (FP16 • Transformers)
Model summary
A supervised fine-tune of microsoft/phi-3-mini-128k-instruct to act as a game NPC brain that converts player/context events into a strict JSON with:
dialog: one NPC utterance (natural, short, varied)intent: compact semantic label for downstream logicmicroplan: 0–5 lightweight animation/pose hints
Training style (latest): English-only; JSON-only outputs; schema checks during evaluation; scenario-imbalanced synthetic dataset with high phrase diversity and soft augmentations.
Artifacts provided in this repo:
- merged/ – FP16 Transformers weights (LoRA merged)
- adapter/ – LoRA adapter (PEFT) for continued SFT
- gguf/ – llama.cpp GGUF (F16 + Q4_K_M + Q3_K_M)
Quick start (Transformers)
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import json
MODEL_ID = "AndriLawrence/phi3-mini-128k-sft-merged"
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
pipe = pipeline("text-generation", model=mdl, tokenizer=tok, device_map="auto", torch_dtype="auto")
system = (
"You are LLM-1, an NPC brain.\n"
"Output ONLY strict JSON with keys: dialog, intent, microplan.\n"
"Start with NPC reply; ≤2 sentences; JSON only."
)
payload = {
"event": "Player_Says",
"speech_transcript": "Hi.",
"environment": {"location": "Room", "time_of_day": "Evening"},
"world_state": {"zones": ["Room"], "objects": ["desk", "lamp", "note"]}
}
msgs = [
{"role":"system","content": system},
{"role":"user","content": "CONTEXT:\n" + json.dumps(payload, ensure_ascii=False)}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
gen = pipe(
prompt,
do_sample=True,
temperature=0.35,
top_p=0.95,
repetition_penalty=1.15,
max_new_tokens=192
)[0]["generated_text"]
# strip assistant prefix if present and parse
out = gen.split("<|assistant|>")[-1].strip()
print(json.loads(out)) # will raise if not valid JSON (by design)
Output schema
{
"dialog": [
{"speaker": "npc", "text": "Hey there—good to see you."}
],
"intent": "social_greeting",
"microplan": ["Smile (0.7)", "Look at player (1.5s)"]
}
- dialog: exactly one NPC line (3–140 chars); no echoing the player.
- intent: one of:
social_greeting, light_acknowledge_and_offer_help, acknowledge_touch, acknowledge_compliment, apologize_and_offer_fix, calm_reassure, encourage_explain, invite_follow, invite_practice, respect_distance, small_talk, end_conversation_politely, react_to_player_action, idle_initiative - microplan: list of short, engine-agnostic motion cues; can be empty (
[]).
Best-Use Guide (Prompts • Parameters • Runtime)
1) Best-Use Prompts (ready to use)
A. Minimal (strict JSON, lightweight, production)
You are LLM-1 (NPC brain).
Return ONE object of STRICT JSON ONLY with keys:
- "dialog": array of { "speaker": "npc", "text": string } (1–2 items, 3–140 chars each)
- "intent": one of [social_greeting, light_acknowledge_and_offer_help, acknowledge_touch, acknowledge_compliment, apologize_and_offer_fix, calm_reassure, encourage_explain, invite_follow, invite_practice, respect_distance, small_talk, end_conversation_politely, react_to_player_action, idle_initiative]
- "microplan": REQUIRED array (0–5 short steps). Use [] if no action.
Rules:
- Start with NPC reply (no quoting player). Avoid starting with "I'm"/"I am".
- ≤2 sentences; concise & warm.
NOW RESPOND TO THIS CONTEXT:
{CONTEXT_JSON}
OUTPUT:
B. Hardened (adds hard-mapping + brief few-shots)
You are LLM-1 (creative social responder).
Return ONE object of STRICT JSON ONLY with keys:
- dialog: [{ "speaker": "npc", "text": string }] (1–2 items)
- intent: (allowed set above)
- microplan: REQUIRED array (0–5 steps)
Hard rules:
- If event == "Player_Touches" → intent MUST be "acknowledge_touch".
- If event == "Player_Action" → intent MUST be "react_to_player_action".
- If player's text contains (nice|great|love|beautiful|cool) → intent MUST be "acknowledge_compliment".
- Start text NOT with "I'm" or "I am". No helper clichés ("I'm here to help", etc). JSON only.
FEW-SHOTS
CONTEXT:
{"event":"Player_Says","speech_transcript":"Hi.","environment":{"location":"Room","time_of_day":"Evening"}}
OUTPUT:
{"dialog":[{"speaker":"npc","text":"Hey there—good to see you."}],"intent":"social_greeting","microplan":["Smile (0.7)"]}
CONTEXT:
{"event":"Player_Touches","player_touch":{"type":"Tap","bone":"Shoulder"}}
OUTPUT:
{"dialog":[{"speaker":"npc","text":"Oh—hi there. Did you need something?"}],"intent":"acknowledge_touch","microplan":["Small startle","Recover smile"]}
CONTEXT:
{"event":"Player_Action","action":"pick_up","target":"note"}
OUTPUT:
{"dialog":[{"speaker":"npc","text":"That could be useful—tell me if it’s unclear."}],"intent":"react_to_player_action","microplan":["Glance at item (1s)"]}
NOW RESPOND TO THIS CONTEXT:
{CONTEXT_JSON}
OUTPUT:
Replace
{CONTEXT_JSON}with your game payload (event, speech_transcript, environment, world_state, etc.).
2) Best-Use Parameters (three presets)
| Preset | Purpose | Params (Ollama/Transformers) |
|---|---|---|
| STRICT | Maximum JSON compliance & intent mapping | temperature=0.0, top_p=0.9, repetition_penalty=1.05–1.15, num_ctx=2048 |
| BALANCED | Small style variation, still stable | temperature=0.35, top_p=0.85, repetition_penalty=1.05, num_ctx=2048 |
| CREATIVE | More expressive (use a fallback) | temperature=0.2, top_p=0.9, repetition_penalty=1.15, num_ctx=2048 |
Practical advice:
- Production → start with STRICT, then allow BALANCED for safe scenes.
- If high variation is needed, use CREATIVE with a retry (see guards below).
3) Best-Use Runtime Guards (production-safe)
A. Retry policy
def gen_with_retry(call_fn, prompt):
# try creative/balanced first
cfgs = [
dict(temperature=0.35, top_p=0.85, repetition_penalty=1.05), # BALANCED
dict(temperature=0.0, top_p=0.9, repetition_penalty=1.05), # STRICT (fallback)
]
for opt in cfgs:
out = call_fn(prompt, **opt)
obj = try_parse_json(out)
if obj:
return obj, opt
raise RuntimeError("Model did not return valid JSON after retries.")
B. Intent router (post-decode cleanup for critical mappings)
def intent_router(ctx, obj):
ev = ctx.get("event")
text = (ctx.get("speech_transcript") or "").lower()
if ev == "Player_Touches":
obj["intent"] = "acknowledge_touch"
elif ev == "Player_Action":
obj["intent"] = "react_to_player_action"
elif any(w in text for w in ["nice","great","love","beautiful","cool"]):
obj["intent"] = "acknowledge_compliment"
# enforce minimal format
if not isinstance(obj.get("microplan"), list):
obj["microplan"] = []
if not obj.get("dialog") or obj["dialog"][0].get("speaker") != "npc":
obj["dialog"] = [{"speaker":"npc","text":"Noted."}]
return obj
4) Example Call (Ollama /api/generate)
curl -s http://localhost:11434/api/generate -d '{
"model": "phi3sft:latest",
"prompt": "'"$(printf "%s" "$PROMPT_WITH_CONTEXT")"'",
"stream": false,
"format": "json",
"options": { "temperature": 0.35, "top_p": 0.85, "repeat_penalty": 1.05, "num_ctx": 2048, "stop": ["<|end|>"] }
}'
5) Quick-Choice Summary
- Need maximum safety? → STRICT.
- Need to stay natural? → BALANCED.
- Need extra flavor? → CREATIVE + retry + router.
With the presets above, tests show 100% valid-JSON and stable latency; the main difference is the rate of policy-mismatch. The BALANCED preset provides a sweet spot between sentence variation and intent compliance.
Dataset
- Language: English-only
- Format: ChatML JSONL with
messages: [system, user, assistant] - Assistant targets: strict JSON (
dialog/intent/microplan) - Scenario mix (default) (normalized):
- greet 0.18, explore 0.18, compliment 0.12, touch 0.10, action_pickup 0.20, proximity 0.12, idle 0.10
- Diversity:
- Rich phrase banks per intent, soft augmentations (punctuation, optional truncation), banned starts/phrases to avoid clichés, varied microplan length.
Example training row:
{"messages":[
{"role":"system","content":"You are LLM-1... JSON only ..."},
{"role":"user","content":"CONTEXT:\n{\"event\":\"Player_Says\",\"speech_transcript\":\"Nice room.\",\"environment\":{\"location\":\"Room\"}}"},
{"role":"assistant","content":"{\"dialog\":[{\"speaker\":\"npc\",\"text\":\"Thanks—glad it feels that way.\"}],\"intent\":\"acknowledge_compliment\",\"microplan\":[\"Smile (0.7)\"]}"}
]}
Training (SFT + LoRA)
- Base:
microsoft/phi-3-mini-128k-instruct - PEFT: LoRA (attention-focused, extendable to MLP)
- Precision: FP16
- Typical config:
r∈{16…96},lora_alpha∈{32…192},lora_dropout=0.05- SFT epochs: 2–4, effective batch ≥ 4 with grad accumulation
temperature=0.35, top_p=0.95, repetition_penalty=1.15(eval)
- Artifacts: we provide both merged FP16 and adapter to support either direct use or continued finetuning.
llama.cpp / GGUF
gguf/ includes:
model-f16.gguf(converted from merged FP16)model-Q4_K_M.gguf(quantized)
Example Modelfile (Ollama-style):
FROM <your_gguf_path>
TEMPLATE """<|system|>
{{ .System }}
<|end|>
<|user|>
{{ .Prompt }}
<|end|>
<|assistant|>
"""
PARAMETER num_ctx 8192
PARAMETER temperature 0.35
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|end|>"
Prompt your system/user exactly like in the Transformers example and request JSON-only.
License
Derivative of microsoft/phi-3-mini-128k-instruct. Please review Microsoft’s license and usage policy. Your fine-tuned weights, adapters, and GGUF exports remain subject to upstream terms.
- Downloads last month
- 30