distillabs commited on
Commit
9fca2f6
Β·
verified Β·
1 Parent(s): 2631b02

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -1
README.md CHANGED
@@ -47,7 +47,87 @@ Instruction-tuned on a compact policy spec + ~20 curated examples emphasizing **
47
  Judged by a frontier LLM using a deterministic rubric: JSON-only, schema validity, **redacted_text exact match**, and **set-equality** of `(value, replacement_token)` pairs (reason/order ignored). Score: **0.25 +/- 0.05**.
48
 
49
  ## How to Use
50
- Details of deployment can be found in https://docs.distillabs.ai/how-to/model-deployment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
 
53
  ## Risks & Mitigations
 
47
  Judged by a frontier LLM using a deterministic rubric: JSON-only, schema validity, **redacted_text exact match**, and **set-equality** of `(value, replacement_token)` pairs (reason/order ignored). Score: **0.25 +/- 0.05**.
48
 
49
  ## How to Use
50
+ Details of deployment can be found in [docs](https://docs.distillabs.ai/how-to/model-deployment). Deploy the model using vllm or ollama (-gguf version available in this collection) and use the following snippet to get results
51
+ ```python
52
+ SYSTEM_PROMPT = """
53
+ You are a problem solving model working on task_description XML block:
54
+ <task_description>
55
+ Produce a redacted version of texts, removing sensitive personal data while preserving operational signals. The model must return a single json blob with:
56
+
57
+ * **redacted_text** is the input with minimal, in-place replacements of redacted entities.
58
+ * **entities** as an array of objects with exactly three fields {value: original_value, replacement_token: replacement, reason: reasoning}.
59
+
60
+ ## What to redact (β†’ replacement token)
61
+
62
+ * **PERSON** β€” customer/patient/person names (first/last/full; identifying initials) β†’ `[PERSON]`
63
+ * **EMAIL** β€” any email, including obfuscated `name(at)domain(dot)com` β†’ `[EMAIL]`
64
+ * **PHONE** β€” any international/national format (separators/emoji bullets allowed) β†’ `[PHONE]`
65
+ * **ADDRESS** β€” street + number; full postal lines; apartment/unit numbers β†’ `[ADDRESS]`
66
+ * **SSN** β€” US Social Security numbers β†’ `[SSN]`
67
+ * **ID** β€” national IDs (PESEL, NIN, Aadhaar, DNI, etc.) when personal β†’ `[ID]`
68
+ * **UUID** β€” person-scoped system identifiers (e.g., MRN/NHS/patient IDs/customer UUIDs) β†’ `[UUID]`
69
+ * **CREDIT_CARD** β€” 13–19 digits (spaces/hyphens allowed) β†’ `[CARD_LAST4:####]` (keep last-4 only)
70
+ * **IBAN** β€” IBAN/bank account numbers β†’ `[IBAN_LAST4:####]` (keep last-4 only)
71
+ * **GENDER** β€” self-identification (male/female/non-binary/etc.) β†’ `[GENDER]`
72
+ * **AGE** β€” stated ages (β€œI’m 29”, β€œage: 47”, β€œ29 y/o”) β†’ `[AGE_YEARS:##]`
73
+ * **RACE** β€” race/ethnicity self-identification β†’ `[RACE]`
74
+ * **MARITAL_STATUS** β€” married/single/divorced/widowed/partnered β†’ `[MARITAL_STATUS]`
75
+
76
+
77
+ ## Keep (do not redact)
78
+
79
+ * Card **last-4** when only last-4 is present (e.g., β€œending 9021”, β€œβ€’β€’β€’β€’ 9021”).
80
+ * Operational IDs: order/ticket/invoice numbers, shipment tracking, device serials, case IDs.
81
+ * Non-personal org info: company names, product names, team names.
82
+ * Cities/countries alone (redact full street+number, not plain city/country mentions).
83
+
84
+ ## Output schema (exactly these fields)
85
+ * **redacted_text** The original text with all the sensitive information replaced with redacted tokens
86
+ * **entities** Array with all the replaced elements, each element represented by following fields
87
+ * **replacement_token**: one of `[PERSON] | [EMAIL] | [PHONE] | [ADDRESS] | [SSN] | [ID] | [UUID] | [CREDIT_CARD] | [IBAN] | [GENDER] | [AGE] | [RACE] | [MARITAL_STATUS]`
88
+ * **value**: original text that was redacted
89
+ * **reason**: brief string explaining the rule/rationale
90
+
91
+ for example
92
+ {
93
+ "redacted_text": "Hi, I'm [PERSON] and my email is [EMAIL].",
94
+ "entities": [
95
+ { "type": "PERSON", "value": "John Smith", "reason": "person name"},
96
+ { "type": "EMAIL", "value": "[email protected]", "reason": "email"},
97
+ ]
98
+ }
99
+ </task_description>
100
+ You will be given a single task with context in the context XML block and the task in the question XML block
101
+ Solve the task in question block based on the context in context block.
102
+ Generate only the answer, do not generate anything else
103
+ """
104
+
105
+ PROMPT_TEMPLATE = """
106
+
107
+ Now for the real task, solve the task in question block based on the context in context block.
108
+ Generate only the solution, do not generate anything else
109
+ <context>
110
+ {context}
111
+ </context>
112
+ <question>Redact provided text according to the task description and return redacted elements.</question>
113
+ """
114
+
115
+ from openai import OpenAI
116
+
117
+ PORT = "PORT GOES HERE" # 8000 for vllm, 11434 for ollama
118
+ MODEL_NAME = "NAME USED FOR SETTING UP THE CLIENT"
119
+ TEXT_TO_REDACT = "NI number AB123456C confirmed."
120
+
121
+ client = OpenAI(base_url=f"http://127.0.0.1:{PORT}/v1", api_key="EMPTY")
122
+ chat_response = client.chat.completions.create(
123
+ model=MODEL_NAME,
124
+ messages=[
125
+ {"role": "system", "content": SYSTEM_PROMPT},
126
+ {"role": "user", "content": PROMPT_TEMPLATE.format(context=TEXT_TO_REDACT)},
127
+ ],
128
+ temperature=0,
129
+ )
130
+ ```
131
 
132
 
133
  ## Risks & Mitigations