phi3-mini-128k-sft-merged (FP16 • Transformers)

Model summary

A supervised fine-tune of microsoft/phi-3-mini-128k-instruct to act as a game NPC brain that converts player/context events into a strict JSON with:

dialog: one NPC utterance (natural, short, varied)
intent: compact semantic label for downstream logic
microplan: 0–5 lightweight animation/pose hints

Training style (latest): English-only; JSON-only outputs; schema checks during evaluation; scenario-imbalanced synthetic dataset with high phrase diversity and soft augmentations.

Artifacts provided in this repo:

merged/ – FP16 Transformers weights (LoRA merged)
adapter/ – LoRA adapter (PEFT) for continued SFT
gguf/ – llama.cpp GGUF (F16 + Q4_K_M + Q3_K_M)

Quick start (Transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import json

MODEL_ID = "AndriLawrence/phi3-mini-128k-sft-merged"

tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)
pipe = pipeline("text-generation", model=mdl, tokenizer=tok, device_map="auto", torch_dtype="auto")

system = (
  "You are LLM-1, an NPC brain.\n"
  "Output ONLY strict JSON with keys: dialog, intent, microplan.\n"
  "Start with NPC reply; ≤2 sentences; JSON only."
)

payload = {
  "event": "Player_Says",
  "speech_transcript": "Hi.",
  "environment": {"location": "Room", "time_of_day": "Evening"},
  "world_state": {"zones": ["Room"], "objects": ["desk", "lamp", "note"]}
}

msgs = [
  {"role":"system","content": system},
  {"role":"user","content": "CONTEXT:\n" + json.dumps(payload, ensure_ascii=False)}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

gen = pipe(
  prompt,
  do_sample=True,
  temperature=0.35,
  top_p=0.95,
  repetition_penalty=1.15,
  max_new_tokens=192
)[0]["generated_text"]

# strip assistant prefix if present and parse
out = gen.split("<|assistant|>")[-1].strip()
print(json.loads(out))  # will raise if not valid JSON (by design)

Output schema

{
  "dialog": [
    {"speaker": "npc", "text": "Hey there—good to see you."}
  ],
  "intent": "social_greeting",
  "microplan": ["Smile (0.7)", "Look at player (1.5s)"]
}

dialog: exactly one NPC line (3–140 chars); no echoing the player.
intent: one of: social_greeting, light_acknowledge_and_offer_help, acknowledge_touch, acknowledge_compliment, apologize_and_offer_fix, calm_reassure, encourage_explain, invite_follow, invite_practice, respect_distance, small_talk, end_conversation_politely, react_to_player_action, idle_initiative
microplan: list of short, engine-agnostic motion cues; can be empty ([]).

Best-Use Guide (Prompts • Parameters • Runtime)

1) Best-Use Prompts (ready to use)

A. Minimal (strict JSON, lightweight, production)

You are LLM-1 (NPC brain).
Return ONE object of STRICT JSON ONLY with keys:
- "dialog": array of { "speaker": "npc", "text": string } (1–2 items, 3–140 chars each)
- "intent": one of [social_greeting, light_acknowledge_and_offer_help, acknowledge_touch, acknowledge_compliment, apologize_and_offer_fix, calm_reassure, encourage_explain, invite_follow, invite_practice, respect_distance, small_talk, end_conversation_politely, react_to_player_action, idle_initiative]
- "microplan": REQUIRED array (0–5 short steps). Use [] if no action.

Rules:
- Start with NPC reply (no quoting player). Avoid starting with "I'm"/"I am".
- ≤2 sentences; concise & warm.

NOW RESPOND TO THIS CONTEXT:
{CONTEXT_JSON}
OUTPUT:

B. Hardened (adds hard-mapping + brief few-shots)

You are LLM-1 (creative social responder).
Return ONE object of STRICT JSON ONLY with keys:
- dialog: [{ "speaker": "npc", "text": string }] (1–2 items)
- intent: (allowed set above)
- microplan: REQUIRED array (0–5 steps)

Hard rules:
- If event == "Player_Touches"  → intent MUST be "acknowledge_touch".
- If event == "Player_Action"   → intent MUST be "react_to_player_action".
- If player's text contains (nice|great|love|beautiful|cool) → intent MUST be "acknowledge_compliment".
- Start text NOT with "I'm" or "I am". No helper clichés ("I'm here to help", etc). JSON only.

FEW-SHOTS
CONTEXT:
{"event":"Player_Says","speech_transcript":"Hi.","environment":{"location":"Room","time_of_day":"Evening"}}
OUTPUT:
{"dialog":[{"speaker":"npc","text":"Hey there—good to see you."}],"intent":"social_greeting","microplan":["Smile (0.7)"]}

CONTEXT:
{"event":"Player_Touches","player_touch":{"type":"Tap","bone":"Shoulder"}}
OUTPUT:
{"dialog":[{"speaker":"npc","text":"Oh—hi there. Did you need something?"}],"intent":"acknowledge_touch","microplan":["Small startle","Recover smile"]}

CONTEXT:
{"event":"Player_Action","action":"pick_up","target":"note"}
OUTPUT:
{"dialog":[{"speaker":"npc","text":"That could be useful—tell me if it’s unclear."}],"intent":"react_to_player_action","microplan":["Glance at item (1s)"]}

NOW RESPOND TO THIS CONTEXT:
{CONTEXT_JSON}
OUTPUT:

Replace {CONTEXT_JSON} with your game payload (event, speech_transcript, environment, world_state, etc.).

2) Best-Use Parameters (three presets)

Preset	Purpose	Params (Ollama/Transformers)
STRICT	Maximum JSON compliance & intent mapping	`temperature=0.0`, `top_p=0.9`, `repetition_penalty=1.05–1.15`, `num_ctx=2048`
BALANCED	Small style variation, still stable	`temperature=0.35`, `top_p=0.85`, `repetition_penalty=1.05`, `num_ctx=2048`
CREATIVE	More expressive (use a fallback)	`temperature=0.2`, `top_p=0.9`, `repetition_penalty=1.15`, `num_ctx=2048`

Practical advice:

Production → start with STRICT, then allow BALANCED for safe scenes.
If high variation is needed, use CREATIVE with a retry (see guards below).

3) Best-Use Runtime Guards (production-safe)

A. Retry policy

def gen_with_retry(call_fn, prompt):
    # try creative/balanced first
    cfgs = [
        dict(temperature=0.35, top_p=0.85, repetition_penalty=1.05),  # BALANCED
        dict(temperature=0.0,  top_p=0.9,  repetition_penalty=1.05),  # STRICT (fallback)
    ]
    for opt in cfgs:
        out = call_fn(prompt, **opt)
        obj = try_parse_json(out)
        if obj:
            return obj, opt
    raise RuntimeError("Model did not return valid JSON after retries.")

B. Intent router (post-decode cleanup for critical mappings)

def intent_router(ctx, obj):
    ev   = ctx.get("event")
    text = (ctx.get("speech_transcript") or "").lower()
    if ev == "Player_Touches":
        obj["intent"] = "acknowledge_touch"
    elif ev == "Player_Action":
        obj["intent"] = "react_to_player_action"
    elif any(w in text for w in ["nice","great","love","beautiful","cool"]):
        obj["intent"] = "acknowledge_compliment"
    
    # enforce minimal format
    if not isinstance(obj.get("microplan"), list):
        obj["microplan"] = []
    if not obj.get("dialog") or obj["dialog"][0].get("speaker") != "npc":
        obj["dialog"] = [{"speaker":"npc","text":"Noted."}]
    return obj

4) Example Call (Ollama /api/generate)

curl -s http://localhost:11434/api/generate -d '{
  "model": "phi3sft:latest",
  "prompt": "'"$(printf "%s" "$PROMPT_WITH_CONTEXT")"'",
  "stream": false,
  "format": "json",
  "options": { "temperature": 0.35, "top_p": 0.85, "repeat_penalty": 1.05, "num_ctx": 2048, "stop": ["<|end|>"] }
}'

5) Quick-Choice Summary

Need maximum safety? → STRICT.
Need to stay natural? → BALANCED.
Need extra flavor? → CREATIVE + retry + router.

With the presets above, tests show 100% valid-JSON and stable latency; the main difference is the rate of policy-mismatch. The BALANCED preset provides a sweet spot between sentence variation and intent compliance.

Dataset

Language: English-only
Format: ChatML JSONL with messages: [system, user, assistant]
Assistant targets: strict JSON (dialog/intent/microplan)
Scenario mix (default) (normalized):
- greet 0.18, explore 0.18, compliment 0.12, touch 0.10, action_pickup 0.20, proximity 0.12, idle 0.10
Diversity:
- Rich phrase banks per intent, soft augmentations (punctuation, optional truncation), banned starts/phrases to avoid clichés, varied microplan length.

Example training row:

{"messages":[
  {"role":"system","content":"You are LLM-1... JSON only ..."},
  {"role":"user","content":"CONTEXT:\n{\"event\":\"Player_Says\",\"speech_transcript\":\"Nice room.\",\"environment\":{\"location\":\"Room\"}}"},
  {"role":"assistant","content":"{\"dialog\":[{\"speaker\":\"npc\",\"text\":\"Thanks—glad it feels that way.\"}],\"intent\":\"acknowledge_compliment\",\"microplan\":[\"Smile (0.7)\"]}"}
]}

Training (SFT + LoRA)

Base: microsoft/phi-3-mini-128k-instruct
PEFT: LoRA (attention-focused, extendable to MLP)
Precision: FP16
Typical config:
- r∈{16…96}, lora_alpha∈{32…192}, lora_dropout=0.05
- SFT epochs: 2–4, effective batch ≥ 4 with grad accumulation
- temperature=0.35, top_p=0.95, repetition_penalty=1.15 (eval)
Artifacts: we provide both merged FP16 and adapter to support either direct use or continued finetuning.

llama.cpp / GGUF

gguf/ includes:

model-f16.gguf (converted from merged FP16)
model-Q4_K_M.gguf (quantized)

Example Modelfile (Ollama-style):

FROM <your_gguf_path>

TEMPLATE """<|system|>
{{ .System }}
<|end|>
<|user|>
{{ .Prompt }}
<|end|>
<|assistant|>
"""

PARAMETER num_ctx 8192
PARAMETER temperature 0.35
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER stop "<|end|>"

Prompt your system/user exactly like in the Transformers example and request JSON-only.

License

Derivative of microsoft/phi-3-mini-128k-instruct. Please review Microsoft’s license and usage policy. Your fine-tuned weights, adapters, and GGUF exports remain subject to upstream terms.

Downloads last month: 30

Safetensors

Model size

4B params

Tensor type

F16