1Teng commited on
Commit
e635e9d
·
verified ·
1 Parent(s): 7194898

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -216
README.md CHANGED
@@ -1,6 +1,5 @@
1
  ---
2
  language:
3
- - en
4
  - zh
5
  license: other
6
  pipeline_tag: text-generation
@@ -12,275 +11,221 @@ tags:
12
  - fp16
13
  - context-4096
14
  ---
 
15
 
16
- # HaS_4.0_0.6B_q0f16-MLC
17
 
18
- HaS is a data de-identification model developed by Tencent Xuanwu Lab. Its core goal is to identify sensitive entities in text (such as names, addresses, and phone numbers) and securely replace them with standardized anonymization tags, preserving privacy while maintaining the original structure and semantic coherence. HaS supports 22 languages including Chinese/English/French/Japanese/Korean, and can be deployed server-side for high accuracy or client-side for lightweight operation. This repository provides the client-side lightweight form of the HaS (Hide and Seek) solution: deployable artifacts (weight shards, tokenizer, and inference library) based on Qwen3 0.6B compiled by MLC, which can participate locally/in the browser/on mobile in HaS pipeline subtasks such as NER/HIDE/PAIR/SEEK. The model uses the ChatML conversation template and FP16 precision (q0f16), with a 4096 context window.
19
 
20
- HaS Overview:
21
- - Objective: Identify and anonymize sensitive entities such as names, addresses, and phone numbers, while maximizing preservation of original structure and semantic coherence.
22
- - Languages: Supports 22 languages including Chinese, English, French, Japanese, and Korean.
23
- - Model forms: Client-side 0.6B/1B aiming for low latency (this repo is the 0.6B client-side form).
24
 
25
- For more on the pipeline, tag standards, and prompt templates, see “HaS Model Functions & Usage (Summary)” below.
26
 
27
- ---
28
-
29
- ## Directory Structure
30
 
31
- `mlc-chat-config.json`: MLC chat model config (template, default sampling values, context size, etc.)
32
- `tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `added_tokens.json`: Tokenizer-related files
33
- `ndarray-cache.json`: Weight shard index and verification info
34
- `params_shard_*.bin`: Weight shards (FP16)
35
- `configuration.json`: General metadata (no manual changes needed)
36
 
37
- ---
38
 
39
- ## Model Specs
40
 
41
- - Model type: Qwen3 (`model_type: qwen3`)
42
- - Approx parameter size: 0.6B (total weights ~1.19 GB FP16)
43
- - Precision & quantization: FP16 (`quantization: q0f16`, BitsPerParam = 16)
44
- - Architecture:
45
- - Layers `num_hidden_layers`: 28
46
- - Hidden size `hidden_size`: 1024
47
- - Intermediate size `intermediate_size`: 3072 (SwiGLU/SiLU activation)
48
- - Attention heads `num_attention_heads`: 16, KV heads `num_key_value_heads`: 8
49
- - Vocab size `vocab_size`: 151,936
50
- - Context window: 4096 (`context_window_size`)
51
- - Conversation template: ChatML (`conv_template: chatml`)
52
 
53
- > Note: `tokenizer_config.json` may set `model_max_length` greater than the runtime window, but actual inference follows `context_window_size` in `mlc-chat-config.json`.
54
 
55
- ---
56
 
57
- ## Quick Start (Local Inference)
58
 
59
- Below uses the MLC LLM CLI, assuming you are at the root of this repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- ### 1) Install Dependencies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- - Python 3.10+ is recommended. Install MLC LLM and runtimes:
64
 
65
- ```bash
66
- pip install -U mlc-llm mlc-ai
67
- # For latest features, you can try: pip install --pre -U mlc-llm-nightly mlc-ai-nightly
68
- ```
69
 
70
- > Install the appropriate backend (Metal/CUDA/Vulkan/CPU) for your machine following MLC’s official documentation. The `model.so` included here targets macOS/Metal.
71
 
72
- ### 2) Chat (CLI)
73
 
74
- ```bash
75
- mlc_llm chat --model resolve/main
76
- # On Apple Silicon devices, to explicitly specify:
77
- # mlc_llm chat --model resolve/main --device metal
78
- ```
79
 
80
- Enter the interactive interface and type directly in Chinese or English to chat.
81
 
82
- ### 3) Start Local Server (Optional)
83
 
84
- ```bash
85
- mlc_llm serve --model resolve/main --host 127.0.0.1 --port 8000
86
  ```
87
-
88
- Refer to the documentation for your installed `mlc-llm` version for specific API details and optional parameters.
89
-
90
- ---
91
-
92
- ## ChatML Prompt Format
93
-
94
- This model uses ChatML; conversations consist of `system`/`user`/`assistant` roles and delimiters. If you construct raw prompts directly (bypassing chat frontends), refer to:
95
-
96
- ```
97
- <|im_start|>system
98
- You are a helpful and honest AI assistant.<|im_end|>
99
- <|im_start|>user
100
- Hello, please introduce yourself.<|im_end|>
101
- <|im_start|>assistant
102
  ```
103
 
104
- - Messages start with `<|im_start|>role`, end with `<|im_end|>`, followed by a newline.
105
- - Generation continues from the last `assistant` line and stops at `<|im_end|>` or stop tokens.
106
- - You can adjust the default `system_message`, stop tokens, and sampling parameters in `resolve/main/mlc-chat-config.json` (back up before editing).
107
-
108
- ---
109
-
110
- ## Resources & Performance Tips
111
-
112
- - Weights are ~1.2 GB (FP16); runtime also requires KV cache and temporary buffer memory.
113
- - On Apple Silicon, at least 8 GB unified memory is recommended. If OOM occurs:
114
- - Reduce `context_window_size` or input length
115
- - Lower `max_new_tokens` / sampling temperature
116
- - Disable parallel sessions or reduce batch size
117
-
118
- ---
119
-
120
- ## Processing Pipeline
121
-
122
- The end-to-end pipeline consists of multiple subtasks:
123
-
124
- - NER: Extract specified entity categories and contents from input text.
125
- - HIDE: Replace sensitive entities with semantic isomorphic tags to achieve anonymization.
126
- - SPLIT: Split composite tags to ensure one-to-one correspondence.
127
- - PAIR: Build the mapping between “tag ↔ original entity”.
128
- - SEEK: Restore tags back to the original text based on the mapping.
129
-
130
- ### Typical Applications
131
-
132
- - Cross-border data compliance circulation: Provide compliant anonymization for outbound training data.
133
- - Real-time conversation privacy protection: Dynamically remove names, addresses, keys, etc., in chatbots.
134
- - Data cleaning and proactive protection: Automated privacy cleaning for multilingual business data.
135
-
136
- ### Context Length (By Task)
137
-
138
- Server-side HaS family typically supports up to 128K context for NER/HIDE/PAIR/SEEK. Client-side lightweight models follow `context_window_size` in this repo’s `mlc-chat-config.json` (4096).
139
-
140
- ### Privacy Type Specification
141
-
142
- Three ways to specify types via the `Specified types` field in prompts:
143
-
144
- 1) All types:
145
  ```
146
- Specified types: all
 
 
 
147
  ```
148
- 2) Specific types:
 
 
 
149
  ```
150
- Specified types: Type1,Type2,...
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  ```
152
- 3) Emphasized types (in addition to “all”, force replacement):
 
153
  ```
154
- Specified types: all including Type1,Type2,...
 
 
 
155
  ```
 
156
 
157
- ### Semantic Isomorphic Tags (HaS 4.0)
158
-
159
- - Tag format: `<EntityType[Index].Category.Attribute>`
160
- - The same index within a type refers to the same entity; indices across types are unrelated.
161
- - Category and attribute use the same language as the “entity type”; within the same input they must be coherent and consistent.
162
- - Entities can be hierarchical, composite, and feature-based; different features determine category and attribute values.
163
 
164
- Replacement principles (excerpt):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
- - Preserve semantic completeness and use longest match; process only specified types with consistent standards; aside from entity replacement, keep other text (including punctuation/spaces/fullwidth/halfwidth) intact.
167
- - If the input contains angle-bracketed text that is not a standardized tag, keep it as normal text.
168
- - Pronouns/generic titles are not replaced by default; if part of a complete expression, replace along with the whole.
169
- - For hierarchy/inclusion conflicts, prioritize the specified types set and NER granularity; reuse the same index as needed to maintain coreference consistency.
 
 
 
170
 
171
- ### Prompt Templates (Dialog Form; strictly follow the format)
172
 
173
- Named Entity Recognition (NER)
 
174
  ```
175
- [
176
- {
177
- "conversations": [
178
- {
179
- "from": "human",
180
- "value": "Recognize the following entity types in the text.\nSpecified types:[\"Type1\",\"Type2\",...\"]\n<text>{content}</text>"
181
- },
182
- { "from": "gpt", "value": "{NER result}" }
183
- ]
184
- }
185
  ]
186
  ```
187
 
188
- Privacy Anonymization (HIDE)
189
  ```
190
- [
191
- {
192
- "conversations": [
193
- { "from": "human", "value": "Recognize the following entity types in the text.\nSpecified types:[\"Type1,Type2,...\"]\n<text>{content}</text>" },
194
- { "from": "gpt", "value": "{NER result}" },
195
- { "from": "human", "value": "Replace the above-mentioned entity types in the text." },
196
- { "from": "gpt", "value": "{Hide result}" }
197
- ]
198
- }
199
- ]
200
  ```
 
 
201
 
202
- Model Splitting (SPLIT)
 
203
  ```
204
- [
205
- {
206
- "conversations": [
207
- { "from": "human", "value": "Split each composite anonymized key into atomic keys.\nComposite mapping:\n{\"tag_id_1tag_id_2\":[\"entity_1entity_2\"]}" },
208
- { "from": "gpt", "value": "{\"tag_id_1\":[\"entity_1\"],\"tag_id_2\":[\"entity2\"]}" }
209
- ]
210
- }
211
  ]
212
  ```
213
 
214
- Entity Pairing (PAIR)
215
  ```
216
- [
217
- {
218
- "conversations": [
219
- { "from": "human", "value": "<original>{content}</original>\n<anonymized>{Hide result}</anonymized>\nExtract the mapping from anonymized entities to original entities." },
220
- { "from": "gpt", "value": "{Pair result}" }
221
- ]
222
- }
223
- ]
224
  ```
 
225
 
226
- Entity Restoration (SEEK)
 
227
  ```
228
- [
229
- {
230
- "conversations": [
231
- { "from": "human", "value": "The mapping from anonymized entities to original entities:\n{Pair result}\nRestore the original text based on the above mapping:\n{Deepseek result}" },
232
- { "from": "gpt", "value": "{Seek result}" }
233
- ]
234
- }
235
  ]
236
  ```
237
 
238
- Anonymization with History (Leverage historical alignment consistency)
239
  ```
240
- {
241
- "conversations": [
242
- {
243
- "from": "human",
244
- "value": "Recognize the following entity types in the text.\nSpecified types:[\"Type1\",\"Type2\",…]\n<text>{content}</text>"
245
- },
246
- { "from": "gpt", "value": "{NER result}" },
247
  {
248
- "from": "human",
249
- "value": "Replace the above-mentioned entity types in the text according to the existing mapping pairs:{pair_history}"
250
- },
251
- { "from": "gpt", "value": "{hide}" }
252
- ]
253
- }
254
  ```
255
-
256
- > Newlines, punctuation, and field names in the templates must be strictly followed. Do not change them casually (only substitute `{content}` and the type list).
257
-
258
- ### Virtual Tag Inference Engine (System Prompt)
259
-
260
- To prevent the chat model from ignoring tag semantics, declare at the System level:
261
-
262
- - Tag format: `<EntityType[ID].Category.Attribute>`; the same ID within a type refers to the same entity, while IDs across types are unrelated.
263
- - Core principles: Tag placeholders contain no real names; once provided, mappings persist within the session; response priority follows “original → original + mapping → necessary refusal/insufficient information”; text transformation tasks should not refuse due to missing mappings; strictly avoid fabricating precise numbers or non-public information.
264
- - Refusal categories: Identity-based refusal (requires real identity but no mapping), and insufficient-information refusal (mapping exists but still not enough info).
265
- - Task guidance: Q&A, translation, polishing, summarization, sentiment/classification, etc., should rely only on “original text + existing mappings + necessary public common knowledge”.
266
-
267
- ### Diff Algorithm & Model Fallback
268
-
269
- For pairing between “entity ↔ tag”, a Diff-like text difference can be used for fast alignment, greatly shortening pairing and restoration latency. Introduce self-check and model fallback for edge cases: if self-check detects incorrect pairing, fall back to model-based pairing/restoration to ensure correctness.
270
-
271
- ### HaS FAQ
272
-
273
- - Must I strictly follow the template’s newlines/indentation/punctuation?
274
- - Yes. Only `content` and the type list may be substituted; the format must not be changed. A template self-check tool will be provided later.
275
- - In batch data, can the same entity word be replaced with the same target across entries?
276
- - Yes. Add the specified mapping “entity → target” to the template’s corresponding position and provide it to HaS 3.0/4.0 to achieve consistent replacement across multiple inputs.
277
-
278
- ---
279
-
280
- ## License & Sources
281
-
282
- - This repository contains only compiled artifacts and configs for inference; it does not include training code.
283
- - Model weights and their licenses follow the upstream model (Qwen3) and your authorization terms. For commercial use, read and comply with upstream licenses.
284
- - Use and redistribution of MLC/TVM must follow their respective open-source licenses.
285
-
286
- ---
 
1
  ---
2
  language:
 
3
  - zh
4
  license: other
5
  pipeline_tag: text-generation
 
11
  - fp16
12
  - context-4096
13
  ---
14
+ # HaS_4.0_0.6B
15
 
16
+ ## 一、HaS介绍
17
 
18
+ HaS(Hide And/Annotate Seek)是一款专为与大型语言模型(LLM)交互而设计的新一代数据脱敏模型,基于 HaS 可以编排 Agentic Workflow 脱敏系统,如图1所示。其与现有隐私保护技术对比如表1所示:
19
 
20
+ ![image](Table1.png)
 
 
 
21
 
22
+ <center>表1. 技术方案对比</center>
23
 
24
+ 其中,“语义保持”是指脱敏后的实体仍然保留了原来的语义信息;“信息还原”是指脱敏后的实体可以在后续呈现阶段被还原;“开集指定”是指待脱敏的类型可以由用户在使用时通过自然语言自由指定;“指代消解”是指脱敏后的实体仍然保留了脱敏前与其他实体之间的指代消解关系;“多轮对话”是指在多轮对话中进行连续脱敏与还原。“意图识别”是指系统能够按照最小隐私信息披露原则动态的进行信息补充。
 
 
25
 
26
+ ![image](Figure1.png)
 
 
 
 
27
 
28
+ <center>图1. HaS Agentic Workflow</center>
29
 
30
+ ## 二、结构化语义标签
31
 
32
+ 结构化语义标签是HaS首创的一种匿名化技术,如图2所示。作为一种对大型语言模型友好的实体表达方式,它既是HaS的核心技术之一,也是实现指代消解能力的关键。
 
 
 
 
 
 
 
 
 
 
33
 
34
+ ![image](Table2.png)
35
 
36
+ <center>图2. 结构化语义标签</center>
37
 
38
+ 如图所示,我们认为,这种结构化语义标签之所以能被大模型高效理解,其核心原因在于模型在预训练阶段接触了海量的代码数据。这些代码中普遍存在与该标签格式高度同构的语法结构,例如面向对象编程中的同类ID索引和模块化的层级命名空间。正是这种结构上的相似性,使得大模型能够迁移从代码中学到的模式识别能力,仅需少量样本(few-shot)甚至无需样本(zero-shot),即可快速掌握并运用此标签体系进行推理。采用标签脱敏前后的一个例子如下:
39
 
40
+ 脱敏前的Prompt:
41
+ > CloudGenius的联系人是谁?
42
+ > **合同编号:** SAAS-2024-Q3-8801
43
+ > **甲方 (服务提供方):** 云创智能有限公司(以下简称云创智能或CloudGenius)
44
+ > 注册地址: 北京市朝阳区望京科技园A座18层
45
+ > 联系人: 李红(Li Hong)
46
+ > 电话: 010-8888-9999
47
+ > **乙方 (客户):** 繁星贸易集团(Star Trading Group,以下简称繁星或STG)
48
+ > 注册地址: 上海市黄浦区南京东路100号
49
+ > 联系人: 张英建(Zhang Yingjian)
50
+ > 邮箱: [email protected]
51
+ > **1. 服务内容**
52
+ > 云创智能向STG提供名为智销云(SmartSales Cloud)的客户关系管理(CRM)软件服务。服务等级协议(Service Level Agreement, SLA)保证99.95%的在线时间。云创智能将确保该CRM系统的稳定运行。
53
+ > **2. 协议期限**
54
+ > 本协议自2024年8月1日起生效,至2026年7月31日终止,为期两年。任何一方如需提前终止,需提前90天书面通知对方。CloudGenius和繁星贸易均应遵守此条款。
55
+ > **3. 费用与支付**
56
+ > STG同意支付订阅费用,总计人民币贰佰万元整 (¥2,000,000.00/RMB 2 million)。费用需分两次支付:协议生效后15个工作日内支付50%,即100万元;第二年服务开始前支付剩余50%。支付账户为云创智能在中国工商银行(ICBC)开设的账户,账号为6222020200012345678。
57
+ > **4. 保密条款**
58
+ > CloudGenius和STG同意,对在履行本协议过程中获悉的对方商业秘密和技术信息予以保密。此保密义务在本协议终止后持续有效,直至该信息进入公共领域。
59
 
60
+ 脱敏后的Prompt:
61
+ > <组织[001].企业.英文名>的联系人是谁?
62
+ > **合同编号:** <编号[001].合同编号.代码>
63
+ > **甲方 (服务提供方):** <组织[001].企业.完整名称>(以下简称<组织[001].企业.简称>或<组织[001].企业.英文名>)
64
+ > 注册地址: <地点[001].办公地址.完整地址>
65
+ > 联系人: <人物[001].个人.姓名>(<人物[001].个人.拼音名>)
66
+ > 电话: <电话[001].固定电话.号码>
67
+ > **乙方 (客户):** <组织[002].企业.完整名称>(<组织[002].企业.英文名>,以下简称<组织[002].企业.简称>或<组织[002].企业.英文缩写>)
68
+ > 注册地址: <地点[002].办公地址.完整地址>
69
+ > 联系人: <人物[002].个人.姓名>(<人物[002].个人.拼音名>)
70
+ > 邮箱: <邮箱[001].个人邮箱.地址>
71
+ > **1. 服务内容**
72
+ > <组织[001].企业.简称>向<组织[002].企业.英文缩写>提供名为<产品/服务[001].软件.中文名>(<产品/服务[001].软件.英文名>)的<产品/服务[002].软件服务.完整描述>。<产品/服务[003].协议.中文名>(<产品/服务[003].协议.英文名>, <产品/服务[003].协议.英文缩写>)保证<百分比[001].服务指标.数值>的在线时间。<组织[001].企业.简称>将确保该<产品/服务[002].软件服务.简称>的稳定运行。
73
+ > **2. 协议期限**
74
+ > 本协议自<日期/时间[001].具体日期.年月日>起生效,至<日期/时间[002].具体日期.年月日>终止,为期两年。任何一方如需提前终止,需提前<日期/时间[003].时间段.天数>书面通知对方。<组织[001].企业.英文名>和繁星贸易均应遵守此条款。
75
+ > **3. 费用与支付**
76
+ > <组织[002].企业.英文缩写>同意支付订阅费用,总计人民币<金额[001].合同金额.中文大写> (<金额[001].合同金额.数字符号>/<金额[001].合同金额.英文表述>)。费用需分两次支付:协议生效后<日期/时间[004].时间段.工作日数>内支付<百分比[002].支付比例.数值>,即<金额[002].支付金额.中文数字>;第二年服务开始前支付剩余<百分比[002].支付比例.数值>。支付账户为<组织[001].企业.简称>在<组织[003].金融机构.完整名称>(<组织[003].金融机构.英文缩写>)开设的账户,账号为<编号[002].银行账号.号码>。
77
+ > **4. 保密条款**
78
+ > <组织[001].企业.英文名>和<组织[002].企业.英文缩写>同意,对在履行本协议过程中获悉的对方商业秘密和技术信息予以保密。此保密义务在本协议终止后持续有效,直至该信息进入公共领域。
79
 
80
+ 与Presidio、Freysa等其他匿名化方案对比,结构化语义标签不仅保留了层次语义信息以及指代关系,同时还符合相关法规对数据真实性及质量的要求。
81
 
82
+ ![image](Table3.png)
 
 
 
83
 
84
+ <center>图3. 不同匿名化方案对比</center>
85
 
86
+ 如图3所示,两种实体处理方案均存在根本性缺陷。基于实体类型标签(如<ORGANIZATION>)的方法,因无法区分原文中的多个同类实体个体,直接导致了指代关系的丢失和模型的应答失败。而假名替换方案,则存在三个问题:首先,生成虚假信息违背了合规性原则;其次,替换操作有损语义的准确性,进而影响回答质量;��后,非结构化的假名也使后续的程序化解析与工具化的处理带来了困难。
87
 
88
+ ## 三、使用HaS
 
 
 
 
89
 
90
+ ### 方法一(最方便):Chrome、Edge应用市场搜索并安装“HaS脱敏”插件预览版,并在DeepSeek,ChatGPT网页版直接使用
91
 
92
+ ### 方法二(可定制性强):下载模型权重,自己部署HaS模型,配搭工具,编排 Agentic Workflow
93
 
94
+ ### 1)使用ner能力进行敏感实体识别
95
+ 提示词模板:
96
  ```
97
+ messages = [
98
+ {
99
+ "role": "user",
100
+ "content": "Recognize the following entity types in the text.\nSpecified types:[\"组织\",\"地址\",\"人名\"]\n<text>本报西安5月25日电(记者温庆生通讯员裴超)为深入学习贯彻习主席关于国防和军队建设重要论述,借鉴外军人力资源管理有益经验,破解我军军事人力资源政策制度调整改革难题,由解放军西安政治学院举办的“外军人力资源管理研究与借鉴”理论研讨会日前在西安召开。来自总部有关部门、大专院校、科研单位的70余位专家学者参加会议。会议以“借鉴外军人力资源管理有益经验,推进我军人力资源政策制度改革”为主题,围绕外军人力资源管理基本理论、制度设计、运行机理及有益做法进行了研讨交流,并对我军人力资源政策制度调整改革提出了一系列具有重要参考价值的对策建议。\n烦请替我撰写摘要。</text>"
101
+ }
102
+ ]
 
 
 
 
 
 
 
 
 
103
  ```
104
 
105
+ 期待输出:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ```
107
+ {
108
+ "role": "assistant",
109
+ "content": "{\"组织\":[\"解放军西安政治学院\"],\"地址\":[\"西安\"],\"人名\":[\"记者温庆生\",\"通讯员裴超\"]}"
110
+ }
111
  ```
112
+ 注:Specified types里的实体类型支持任意指定
113
+
114
+ ### 2)使用hide能力进行标签化脱敏(不带历史脱敏记录)
115
+ 提示词模板:
116
  ```
117
+ messages = [
118
+ {
119
+ "role": "user",
120
+ "content": "Recognize the following entity types in the text.\nSpecified types:[\"人名\",\"联系方式\",\"职务\",\"密码\",\"组织\",\"地址\",\"文件\",\"账号\"]\n<text>(原标题:山东菏泽单县村民自建房屋坍塌已致4人遇难) 单县一在建民房发生坍塌事故 菏泽市政府副市长王忠想看望伤员齐鲁网10月25日菏泽讯24日下午16时许,单县谢集���白寨行政村一村民在自建房屋时,突然发生坍塌事故,致12人不同程度受伤。事发后,当地有关部门和周边群众一起迅速展开救援,并将伤者及时送往附近医院救治。截至24日23时,4人经抢救无效死亡,1人伤势较重正在全力救治中,其余7人伤情较轻,正在医院观察治疗。当地警方已介入调查事故原因,善后工作正在进行。\n\n提取人名、组织、时间、地点,以 JSON 返回</text>"
121
+ },
122
+ {
123
+ "role": "assistant",
124
+ "content": "{\"人名\":[\"王忠想\"],\"联系方式\":[],\"职务\":[\"菏泽市政府副市长\"],\"密码\":[],\"组织\":[\"菏泽市政府\",\"齐鲁网\"],\"地址\":[\"单县谢集镇白寨行政村\"],\"文件\":[],\"账号\":[]}"
125
+ },
126
+ {
127
+ "role": "user",
128
+ "content": "Replace the above-mentioned entity types in the text."
129
+ }
130
+ ]
131
  ```
132
+
133
+ 期待输出:
134
  ```
135
+ {
136
+ "role": "assistant",
137
+ "content": "(原标题:山东菏泽单县村民自建房屋坍塌已致4人遇难) 单县一在建民房发生坍塌事故 <职务[1].政府职务.完整职务><人名[1].中文姓名.完整姓名>看望伤员<组织[1].媒体.完整名称>10月25日菏泽讯24日下午16时许,<地址[1].行政区.完整地址>一村民在自建房屋时,突然发生坍塌事故,致12人不同程度受伤。事发后,当地有关部门和周边群众一起迅速展开救援,并将伤者及时送往附近医院救治。截至24日23时,4人经抢救无效死亡,1人伤势较重正在全力救治中,其余7人伤情较轻,正在医院观察治疗。当地警方已介入调查事故原因,善后工作正在进行。\n\n提取人名、组织、时间、地点,以 JSON 返回"
138
+ }
139
  ```
140
+ 注:提示词模板中assistant的内容取自第一轮ner的输出结果
141
 
 
 
 
 
 
 
142
 
143
+ ### 3)使用hide能力进行标签化脱敏(带历史脱敏记录)
144
+ 提示词模板:
145
+ ```
146
+ messages = [
147
+ {
148
+ "role": "user",
149
+ "content": "Recognize the following entity types in the text.\nSpecified types:[\"作品\",\"人名\"]\n<text>《梦想咨客》讲述的是一群怀揣着热情服务人群的中国年轻创业青年,为驴友介绍中国各地不同的民族风情和地理风貌。动画中主角们不同的性格碰撞,形成的一串串乌龙和笑料化为了创业路上的点点欢笑。每一集讲述的是一个以主角“胡妈”作为核心的乐骋旅行社遇到的疑难杂症,但奇思与努力让这些困难迎刃而解,以其特有的乐骋精神传播正能量。\n麻烦把这段中文内容翻译成英文。</text>"
150
+ },
151
+ {
152
+ "role": "assistant",
153
+ "content": "{\"作品\":[\"《梦想咨客》\"],\"人名\":[\"胡妈\"]}"
154
+ },
155
+ {
156
+ "role": "user",
157
+ "content": "Replace the above-mentioned entity types in the text according to the existing mapping pairs:{\"<作品[1].动画作品.片名>\":[\"《泉城水大碗茶》\",\"【奇趣视界】\"],\"<作品[2].动画作品.片名>\":[\"《梦想咨客》\"],\"<作品[3].动画作品.片名>\":[\"穿越时空的对话\"],\"<作品[4].动画作品.片名>\":[\"《宝岛一村》上剧场专属版\"],\"<人名[1].中文姓名.本名>\":[\"胡妈\"]}"
158
+ }
159
+ ]
160
+ ```
161
 
162
+ 期待输出:
163
+ ```
164
+ {
165
+ "role": "assistant",
166
+ "content": "<作品[2].动画作品.片名>讲述的是一群怀揣着热情服务人群的中国年轻创业青年,为驴友介绍中国各地不同的民族风情和地理风貌。动画中主角们不同的性格碰撞,形成的一串串乌龙和笑料化为了创业路上的点点欢笑。每一集讲述的是一个以主角“<人名[1].中文姓名.本名>”作为核心的乐骋旅行社遇到的疑难杂症,但奇思与努力让这些困难迎刃而解,以其特有的乐骋精神传播正能量。\n麻烦把这段中文内容翻译成���文。"
167
+ }
168
+ ```
169
 
170
+ 注:提示词模板中assistant的内容取自第一轮ner的输出结果,existing mapping pairs取自历史对话积累的映射对
171
 
172
+ ### 4)使用pair能力提取标签映射
173
+ 提示词模板:
174
  ```
175
+ messages = [
176
+ {
177
+ "role": "user",
178
+ "content": "<original>请帮我提升一下整体表述。\n\n\n1989年10月27日上午莫斯科时间九点,苏联在哈萨克共和国的萨雷奥泽克试验场销毁了它拥有的九百五十七枚中短程导弹中的最后一批导弹。苏军第一副总参谋长奥梅利切夫上将对塔斯社记者宣布上述消息时说,27日销毁的最后一枚中短程导弹是西方所称的ss·23导弹,射程五百公里,是八十年代初部署的。 关注更多精彩:香港财富俱乐部(微信公号:hkfortuneclub) 业务合作,请直接留言(请留下联络方式及微信号)</original>\n<anonymized>请帮我提升一下整体表述。\n\n\n<日期/时间[1].绝对时间.完整表达>,苏联在哈萨克共和国的萨雷奥泽克试验场销毁了它拥有的<数字[1].数量.完整表达>中短程导弹中的最后一批导弹。<人名[1].军方职务.完整称谓>对塔斯社记者宣布上述消息时说,<日期/时间[1].日期.日>销毁的最后一枚中短程导弹是西方所称的<导弹型号[1].型号.完整名称>,射程<数字[2].距离.完整表达>,是<日期/时间[2].年代.时期>部署的。 关注更多精彩:香港财富俱乐部(微信公号:<微信公号[1].账号.用户名>) 业务合作,请直接留言(请留下联络方式及微信号)</anonymized>\nExtract the mapping from anonymized entities to original entities."
179
+ }
 
 
 
 
 
180
  ]
181
  ```
182
 
183
+ 期待输出:
184
  ```
185
+ {
186
+ "role": "assistant",
187
+ "content": "{\"<日期/时间[1].绝对时间.完整表达>\":[\"1989年10月27日上午莫斯科时间九点\"],\"<数字[1].数量.完整表达>\":[\"九百五十七枚\"],\"<人名[1].军方职务.完整称谓>\":[\"苏军第一副总参谋长奥梅利切夫上将\"],\"<日期/时间[1].日期.日>\":[\"27日\"],\"<导弹型号[1].型号.完整名称>\":[\"ss·23导弹\"],\"<数字[2].距离.完整表达>\":[\"五百公里\"],\"<日期/时间[2].年代.时期>\":[\"八十年代初\"],\"<微信公号[1].账号.用户名>\":[\"hkfortuneclub\"]}"
188
+ }
 
 
 
 
 
 
189
  ```
190
+ 注:提示词模板中anonymized的内容取自hide的输出结果
191
+
192
 
193
+ ### 5)使用split能力进行复合标签拆解
194
+ 提示词模板:
195
  ```
196
+ messages = [
197
+ {
198
+ "role": "user",
199
+ "content": "Split each composite anonymized key into atomic keys.\nComposite mapping:\n{\"<职务[3].职务.职务名称><人名[1].中文姓名.姓名>\": [\"五星村党总支部书记黄丽萍\"], \"<地址[2].行政村.名称><职务[5].职务.职务名称>\": [\"五星村保崩村民小组经济社社长\"]}"
200
+ }
 
 
201
  ]
202
  ```
203
 
204
+ 期待输出:
205
  ```
206
+ {
207
+ "role": "assistant",
208
+ "content": "{\"<职务[3].职务.职务名称>\": [\"党总支部书记\"], \"<人名[1].中文姓名.姓名>\": [\"黄丽萍\"], \"<地址[2].行政村.名称>\": [\"五星村\"], \"<职务[5].职务.职务名称>\": [\"经济社社长\"]}"
209
+ }
 
 
 
 
210
  ```
211
+ 注:提示词模板中Composite mapping取自pair输出中提取的连续标签
212
 
213
+ ### 6)使用seek能力进行标签还原
214
+ 提示词模板:
215
  ```
216
+ messages = [
217
+ {
218
+ "role": "user",
219
+ "content": "The mapping from anonymized entities to original entities:\n{\"<组织[1].新闻机构.完整名称>\":[\"新华社\"],\"<职务[1].新闻传媒.称谓>\":[\"记者\"],\"<人名[1].个人.姓名>\":[\"张毅荣\"],\"<组织[2].科研机构.完整名称>\":[\"罗伯特·科赫研究所\"],\"<职务[2].政府职务.完整称谓>\":[\"德国卫生部长\"],\"<人名[2].个人.姓名>\":[\"劳特巴赫\"],\"<组织[3].政府机构.完整名称>\":[\"联邦议院\"],\"<文件[1].法律法规.正式名称>\":[\"《传染病防治法》\",\"德国最新版《传染病防治法》\"]}\nRestore the original text based on the above mapping:\nAccording to <组织[1].新闻机构.完整名称> in Berlin on March 24 (reported by <职务[1].新闻传媒.称谓> <人名[1].个人.姓名>), the latest pandemic data released on the 24th by Germany’s disease control agency <组织[2].科研机构.完整名称> showed that Germany reported 318,387 new confirmed COVID-19 cases compared to the previous day, marking the first time daily cases exceeded 300,000.\n\nThe data also indicated 300 new COVID-19 related deaths on the 24th, bringing the total death toll to 127,822. The 7-day infection rate set a new record as well, with 1,752 new confirmed cases per 100,000 people over seven days.\n\nOn the 24th, <职务[2].政府职务.完整称谓> <人名[2].个人.姓名> called on all federal states at <组织[3].政府机构.完整名称> to use the new <文件[1].法律法规.正式名称> to strengthen efforts to control the spread of the virus. He said, “We must unite to get through this severe wave of the pandemic.” The <文件[1].法律法规.正式名称> came into effect on the 20th, generally lifting most COVID-19 prevention measures."
220
+ }
 
 
221
  ]
222
  ```
223
 
224
+ 期待输出:
225
  ```
 
 
 
 
 
 
 
226
  {
227
+ "role": "assistant",
228
+ "content": "According to Xinhua News Agency in Berlin on March 24 (reported by reporter Zhang Yirong), the latest pandemic data released on the 24th by Germany’s disease control agency Robert Koch Institute showed that Germany reported 318,387 new confirmed COVID-19 cases compared to the previous day, marking the first time daily cases exceeded 300,000.\n\nThe data also indicated 300 new COVID-19 related deaths on the 24th, bringing the total death toll to 127,822. The 7-day infection rate set a new record as well, with 1,752 new confirmed cases per 100,000 people over seven days.\n\nOn the 24th, German Health Minister Lauterbach called on all federal states at the Bundestag to use the new Infection Protection Act to strengthen efforts to control the spread of the virus. He said, “We must unite to get through this severe wave of the pandemic.” The Infection Protection Act came into effect on the 20th, generally lifting most COVID-19 prevention measures."
229
+ }
 
 
 
230
  ```
231
+ 注:提示词模板中的mapping取自历史会话的标签映射集合,According to的内容取自远端大模型的输出