Update README.md
Browse files
README.md
CHANGED
|
@@ -47,7 +47,87 @@ Instruction-tuned on a compact policy spec + ~20 curated examples emphasizing **
|
|
| 47 |
Judged by a frontier LLM using a deterministic rubric: JSON-only, schema validity, **redacted_text exact match**, and **set-equality** of `(value, replacement_token)` pairs (reason/order ignored). Score: **0.25 +/- 0.05**.
|
| 48 |
|
| 49 |
## How to Use
|
| 50 |
-
Details of deployment can be found in https://docs.distillabs.ai/how-to/model-deployment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
|
| 53 |
## Risks & Mitigations
|
|
|
|
| 47 |
Judged by a frontier LLM using a deterministic rubric: JSON-only, schema validity, **redacted_text exact match**, and **set-equality** of `(value, replacement_token)` pairs (reason/order ignored). Score: **0.25 +/- 0.05**.
|
| 48 |
|
| 49 |
## How to Use
|
| 50 |
+
Details of deployment can be found in [docs](https://docs.distillabs.ai/how-to/model-deployment). Deploy the model using vllm or ollama (-gguf version available in this collection) and use the following snippet to get results
|
| 51 |
+
```python
|
| 52 |
+
SYSTEM_PROMPT = """
|
| 53 |
+
You are a problem solving model working on task_description XML block:
|
| 54 |
+
<task_description>
|
| 55 |
+
Produce a redacted version of texts, removing sensitive personal data while preserving operational signals. The model must return a single json blob with:
|
| 56 |
+
|
| 57 |
+
* **redacted_text** is the input with minimal, in-place replacements of redacted entities.
|
| 58 |
+
* **entities** as an array of objects with exactly three fields {value: original_value, replacement_token: replacement, reason: reasoning}.
|
| 59 |
+
|
| 60 |
+
## What to redact (β replacement token)
|
| 61 |
+
|
| 62 |
+
* **PERSON** β customer/patient/person names (first/last/full; identifying initials) β `[PERSON]`
|
| 63 |
+
* **EMAIL** β any email, including obfuscated `name(at)domain(dot)com` β `[EMAIL]`
|
| 64 |
+
* **PHONE** β any international/national format (separators/emoji bullets allowed) β `[PHONE]`
|
| 65 |
+
* **ADDRESS** β street + number; full postal lines; apartment/unit numbers β `[ADDRESS]`
|
| 66 |
+
* **SSN** β US Social Security numbers β `[SSN]`
|
| 67 |
+
* **ID** β national IDs (PESEL, NIN, Aadhaar, DNI, etc.) when personal β `[ID]`
|
| 68 |
+
* **UUID** β person-scoped system identifiers (e.g., MRN/NHS/patient IDs/customer UUIDs) β `[UUID]`
|
| 69 |
+
* **CREDIT_CARD** β 13β19 digits (spaces/hyphens allowed) β `[CARD_LAST4:####]` (keep last-4 only)
|
| 70 |
+
* **IBAN** β IBAN/bank account numbers β `[IBAN_LAST4:####]` (keep last-4 only)
|
| 71 |
+
* **GENDER** β self-identification (male/female/non-binary/etc.) β `[GENDER]`
|
| 72 |
+
* **AGE** β stated ages (βIβm 29β, βage: 47β, β29 y/oβ) β `[AGE_YEARS:##]`
|
| 73 |
+
* **RACE** β race/ethnicity self-identification β `[RACE]`
|
| 74 |
+
* **MARITAL_STATUS** β married/single/divorced/widowed/partnered β `[MARITAL_STATUS]`
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Keep (do not redact)
|
| 78 |
+
|
| 79 |
+
* Card **last-4** when only last-4 is present (e.g., βending 9021β, ββ’β’β’β’ 9021β).
|
| 80 |
+
* Operational IDs: order/ticket/invoice numbers, shipment tracking, device serials, case IDs.
|
| 81 |
+
* Non-personal org info: company names, product names, team names.
|
| 82 |
+
* Cities/countries alone (redact full street+number, not plain city/country mentions).
|
| 83 |
+
|
| 84 |
+
## Output schema (exactly these fields)
|
| 85 |
+
* **redacted_text** The original text with all the sensitive information replaced with redacted tokens
|
| 86 |
+
* **entities** Array with all the replaced elements, each element represented by following fields
|
| 87 |
+
* **replacement_token**: one of `[PERSON] | [EMAIL] | [PHONE] | [ADDRESS] | [SSN] | [ID] | [UUID] | [CREDIT_CARD] | [IBAN] | [GENDER] | [AGE] | [RACE] | [MARITAL_STATUS]`
|
| 88 |
+
* **value**: original text that was redacted
|
| 89 |
+
* **reason**: brief string explaining the rule/rationale
|
| 90 |
+
|
| 91 |
+
for example
|
| 92 |
+
{
|
| 93 |
+
"redacted_text": "Hi, I'm [PERSON] and my email is [EMAIL].",
|
| 94 |
+
"entities": [
|
| 95 |
+
{ "type": "PERSON", "value": "John Smith", "reason": "person name"},
|
| 96 |
+
{ "type": "EMAIL", "value": "[email protected]", "reason": "email"},
|
| 97 |
+
]
|
| 98 |
+
}
|
| 99 |
+
</task_description>
|
| 100 |
+
You will be given a single task with context in the context XML block and the task in the question XML block
|
| 101 |
+
Solve the task in question block based on the context in context block.
|
| 102 |
+
Generate only the answer, do not generate anything else
|
| 103 |
+
"""
|
| 104 |
+
|
| 105 |
+
PROMPT_TEMPLATE = """
|
| 106 |
+
|
| 107 |
+
Now for the real task, solve the task in question block based on the context in context block.
|
| 108 |
+
Generate only the solution, do not generate anything else
|
| 109 |
+
<context>
|
| 110 |
+
{context}
|
| 111 |
+
</context>
|
| 112 |
+
<question>Redact provided text according to the task description and return redacted elements.</question>
|
| 113 |
+
"""
|
| 114 |
+
|
| 115 |
+
from openai import OpenAI
|
| 116 |
+
|
| 117 |
+
PORT = "PORT GOES HERE" # 8000 for vllm, 11434 for ollama
|
| 118 |
+
MODEL_NAME = "NAME USED FOR SETTING UP THE CLIENT"
|
| 119 |
+
TEXT_TO_REDACT = "NI number AB123456C confirmed."
|
| 120 |
+
|
| 121 |
+
client = OpenAI(base_url=f"http://127.0.0.1:{PORT}/v1", api_key="EMPTY")
|
| 122 |
+
chat_response = client.chat.completions.create(
|
| 123 |
+
model=MODEL_NAME,
|
| 124 |
+
messages=[
|
| 125 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 126 |
+
{"role": "user", "content": PROMPT_TEMPLATE.format(context=TEXT_TO_REDACT)},
|
| 127 |
+
],
|
| 128 |
+
temperature=0,
|
| 129 |
+
)
|
| 130 |
+
```
|
| 131 |
|
| 132 |
|
| 133 |
## Risks & Mitigations
|