alea-institute commited on
Commit
00ceea1
·
verified ·
1 Parent(s): 575e0ae

Update README with KL3M tokenizer paper citation - README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -56
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - financial
11
  - enterprise
12
  - slm
 
13
  date: '2024-02-20T00:00:00.000Z'
14
  pipeline_tag: text-generation
15
  widget:
@@ -18,51 +19,45 @@ widget:
18
  - do_sample: True
19
  ---
20
 
21
- # kl3m-170m Model
22
 
23
- kl3m-170m is a (very) small language model (SLM) model trained on clean, legally-permissible data. Originally
24
  developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
25
- kl3m-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
26
  for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
27
  with a focus on low toxicity and high efficiency.
28
 
29
- Given its small size and lack of instruction-aligned training data, kl3m-170m is best suited for use either in
30
  SLM fine-tuning or as part of training larger models without using unethical data or models.
31
 
32
- The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
33
- being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
34
-
35
- ## Source
36
-
37
- [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
38
-
39
-
40
- ## Training Data
41
- While the original training data collection and training infrastructure relies on software that was not donated by
42
- 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
43
-
44
- [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
45
-
46
- Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
47
- zero-cost distribution model as soon as we can obtain additional support.
48
-
49
- This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
50
- we believe is 100% public domain material. However, so as to enforce maximum transparency to all
51
- downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
52
-
53
  ## Model Details
54
 
55
- ### Summary
56
  - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
57
- - **Parameters**: 170 million
58
- - **Context Window**: 4,096 tokens (true size, no sliding window)
 
 
 
 
 
 
59
  - **Language(s)**: Primarily English
60
- - **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
61
  - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
62
  - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
63
  - **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
64
 
65
- ## Performance Metrics
 
 
 
 
 
 
 
 
 
 
66
 
67
  ### Perplexity Scores
68
  | Dataset | Score |
@@ -81,15 +76,9 @@ larger models as of its training data.
81
  - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
82
  - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
83
 
84
- ## Use Cases
85
-
86
- - Basic regulatory question answering
87
- - Contract provision drafting
88
- - Structured JSON information extraction
89
- - Foundation for downstream optimization
90
- - Base model for domain-specific fine-tuning
91
 
92
- ## Getting Started
93
 
94
  ```python
95
  import json
@@ -119,7 +108,8 @@ print(
119
  ]
120
  ```
121
 
122
- ## Contract Example
 
123
  ```python
124
  text = "Governing Law.\n"
125
  print(
@@ -141,41 +131,109 @@ print(
141
  ]
142
  ```
143
 
144
- ## Technical Implementation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  The model implements several techniques during training:
147
 
148
  - Hybrid NTP and SFT cotraining
149
  - Dynamic, document-aware segmentation
150
  - Randomized padding
151
- - Traditional fixed- attention mechanisms
152
 
153
- ## License
154
 
155
- This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
 
156
 
157
- The model weights are released under the CC-BY 4.0 License.
158
 
159
- ## Contact
 
160
 
161
- The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
162
-
163
- - GitHub: https://github.com/alea-institute/kl3m-model-research
164
- - Email: [email protected]
165
- - Website: https://aleainstitute.ai
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
- ## Acknowledgments
 
 
168
 
169
- Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
  ## Citation
173
 
174
- Tokenizer, dataset, and model publications are pending.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
  ## Contact
177
 
178
- For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
179
- create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
 
 
 
180
 
181
- ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
 
10
  - financial
11
  - enterprise
12
  - slm
13
+ - gpt-neox
14
  date: '2024-02-20T00:00:00.000Z'
15
  pipeline_tag: text-generation
16
  widget:
 
19
  - do_sample: True
20
  ---
21
 
22
+ # kl3m-002-170m
23
 
24
+ kl3m-002-170m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally
25
  developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
26
+ kl3m-002-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
27
  for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
28
  with a focus on low toxicity and high efficiency.
29
 
30
+ Given its small size and lack of instruction-aligned training data, kl3m-002-170m is best suited for use either in
31
  SLM fine-tuning or as part of training larger models without using unethical data or models.
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ## Model Details
34
 
 
35
  - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
36
+ - **Size**: 170 million parameters
37
+ - **Hidden Size**: 1024
38
+ - **Layers**: 16
39
+ - **Attention Heads**: 16
40
+ - **Key-Value Heads**: 8
41
+ - **Intermediate Size**: 1024
42
+ - **Max Sequence Length**: 4,096 tokens (true size, no sliding window)
43
+ - **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
44
  - **Language(s)**: Primarily English
45
+ - **Training Objective**: Next token prediction
46
  - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
47
  - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
48
  - **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
49
 
50
+ ## Use Cases
51
+
52
+ kl3m-002-170m is particularly effective for:
53
+
54
+ - Basic regulatory question answering
55
+ - Contract provision drafting
56
+ - Structured JSON information extraction
57
+ - Foundation for downstream optimization
58
+ - Base model for domain-specific fine-tuning
59
+
60
+ ## Performance
61
 
62
  ### Perplexity Scores
63
  | Dataset | Score |
 
76
  - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
77
  - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
78
 
79
+ ## Usage
 
 
 
 
 
 
80
 
81
+ Basic usage for text generation:
82
 
83
  ```python
84
  import json
 
108
  ]
109
  ```
110
 
111
+ ### Contract Example
112
+
113
  ```python
114
  text = "Governing Law.\n"
115
  print(
 
131
  ]
132
  ```
133
 
134
+ ### Generation Parameters
135
+
136
+ The model supports various parameters to control the generation process:
137
+
138
+ - `temperature`: Controls randomness (lower = more deterministic)
139
+ - `top_p`: Nucleus sampling parameter (lower = more focused)
140
+ - `top_k`: Limits vocabulary selection to top k tokens
141
+ - `max_new_tokens`: Maximum number of tokens to generate
142
+ - `do_sample`: Whether to use sampling vs. greedy decoding
143
+ - `num_return_sequences`: Number of different sequences to generate
144
+
145
+ ## Training
146
+
147
+ The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
148
+ being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
149
 
150
  The model implements several techniques during training:
151
 
152
  - Hybrid NTP and SFT cotraining
153
  - Dynamic, document-aware segmentation
154
  - Randomized padding
155
+ - Traditional fixed-attention mechanisms
156
 
157
+ ### Training Data
158
 
159
+ While the original training data collection and training infrastructure relies on software that was not donated by
160
+ 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
161
 
162
+ [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
163
 
164
+ Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
165
+ zero-cost distribution model as soon as we can obtain additional support.
166
 
167
+ This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
168
+ we believe is 100% public domain material. However, so as to enforce maximum transparency to all
169
+ downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
170
+
171
+ ## Intended Usage
172
+
173
+ This model is intended for use in:
174
+
175
+ - Legal and regulatory document processing systems
176
+ - Contract drafting assistance
177
+ - Financial and enterprise document workflows
178
+ - Educational contexts for learning about domain-specific language models
179
+ - Research on small, efficient language models
180
+
181
+ ## Special Tokens
182
+
183
+ kl3m-002-170m uses the following special tokens:
184
 
185
+ - `<s>` (ID: 0): Beginning of sequence token (BOS)
186
+ - `</s>` (ID: 1): End of sequence token (EOS)
187
+ - `<pad>` (ID: 2): Padding token
188
 
189
+ ## Limitations
190
 
191
+ - Limited to a 4,096 token context window
192
+ - As a small language model (170M parameters), it has limited general knowledge
193
+ - Not instruction-tuned or aligned with human preferences
194
+ - May generate plausible-sounding but incorrect legal or regulatory text
195
+ - Not a substitute for professional legal advice or domain expertise
196
+ - Performance is optimized for legal and financial domains; general performance may be lower
197
+
198
+ ## Ethical Considerations
199
+
200
+ - This model should not be used to generate legal advice without human expert review
201
+ - The model may reflect biases present in the training data despite efforts to use clean data
202
+ - While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
203
+
204
+ ## Source
205
+
206
+ [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
207
+
208
+ ## References
209
+
210
+ - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
211
+ - Additional tokenizer, dataset, and model publications are pending.
212
 
213
  ## Citation
214
 
215
+ ```bibtex
216
+ @misc{kl3m-002-170m,
217
+ author = {ALEA Institute},
218
+ title = {kl3m-002-170m: A Small Language Model for Legal and Regulatory Text},
219
+ year = {2024},
220
+ publisher = {Hugging Face},
221
+ howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-170m}}
222
+ }
223
+ ```
224
+
225
+ ## License
226
+
227
+ This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
228
+
229
+ The model weights are released under the CC-BY 4.0 License.
230
 
231
  ## Contact
232
 
233
+ The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
234
+
235
+ - GitHub: https://github.com/alea-institute/kl3m-model-research
236
+ - Email: [email protected]
237
+ - Website: https://aleainstitute.ai
238
 
239
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)