Update README with KL3M tokenizer paper citation - README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,7 @@ tags:
|
|
| 10 |
- financial
|
| 11 |
- enterprise
|
| 12 |
- slm
|
|
|
|
| 13 |
date: '2024-02-20T00:00:00.000Z'
|
| 14 |
pipeline_tag: text-generation
|
| 15 |
widget:
|
|
@@ -18,51 +19,45 @@ widget:
|
|
| 18 |
- do_sample: True
|
| 19 |
---
|
| 20 |
|
| 21 |
-
# kl3m-170m
|
| 22 |
|
| 23 |
-
kl3m-170m is a (very) small language model (SLM)
|
| 24 |
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
|
| 25 |
-
kl3m-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
|
| 26 |
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
|
| 27 |
with a focus on low toxicity and high efficiency.
|
| 28 |
|
| 29 |
-
Given its small size and lack of instruction-aligned training data, kl3m-170m is best suited for use either in
|
| 30 |
SLM fine-tuning or as part of training larger models without using unethical data or models.
|
| 31 |
|
| 32 |
-
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
|
| 33 |
-
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
|
| 34 |
-
|
| 35 |
-
## Source
|
| 36 |
-
|
| 37 |
-
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
## Training Data
|
| 41 |
-
While the original training data collection and training infrastructure relies on software that was not donated by
|
| 42 |
-
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
|
| 43 |
-
|
| 44 |
-
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
|
| 45 |
-
|
| 46 |
-
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
|
| 47 |
-
zero-cost distribution model as soon as we can obtain additional support.
|
| 48 |
-
|
| 49 |
-
This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
|
| 50 |
-
we believe is 100% public domain material. However, so as to enforce maximum transparency to all
|
| 51 |
-
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
|
| 52 |
-
|
| 53 |
## Model Details
|
| 54 |
|
| 55 |
-
### Summary
|
| 56 |
- **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
|
| 57 |
-
- **
|
| 58 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
- **Language(s)**: Primarily English
|
| 60 |
-
- **
|
| 61 |
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
|
| 62 |
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
| 63 |
- **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
|
| 64 |
|
| 65 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### Perplexity Scores
|
| 68 |
| Dataset | Score |
|
|
@@ -81,15 +76,9 @@ larger models as of its training data.
|
|
| 81 |
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
|
| 82 |
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
|
| 83 |
|
| 84 |
-
##
|
| 85 |
-
|
| 86 |
-
- Basic regulatory question answering
|
| 87 |
-
- Contract provision drafting
|
| 88 |
-
- Structured JSON information extraction
|
| 89 |
-
- Foundation for downstream optimization
|
| 90 |
-
- Base model for domain-specific fine-tuning
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
```python
|
| 95 |
import json
|
|
@@ -119,7 +108,8 @@ print(
|
|
| 119 |
]
|
| 120 |
```
|
| 121 |
|
| 122 |
-
|
|
|
|
| 123 |
```python
|
| 124 |
text = "Governing Law.\n"
|
| 125 |
print(
|
|
@@ -141,41 +131,109 @@ print(
|
|
| 141 |
]
|
| 142 |
```
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
The model implements several techniques during training:
|
| 147 |
|
| 148 |
- Hybrid NTP and SFT cotraining
|
| 149 |
- Dynamic, document-aware segmentation
|
| 150 |
- Randomized padding
|
| 151 |
-
- Traditional fixed-
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
-
|
|
|
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
-
|
|
|
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
## Citation
|
| 173 |
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
## Contact
|
| 177 |
|
| 178 |
-
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-

|
|
|
|
| 10 |
- financial
|
| 11 |
- enterprise
|
| 12 |
- slm
|
| 13 |
+
- gpt-neox
|
| 14 |
date: '2024-02-20T00:00:00.000Z'
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
widget:
|
|
|
|
| 19 |
- do_sample: True
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# kl3m-002-170m
|
| 23 |
|
| 24 |
+
kl3m-002-170m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally
|
| 25 |
developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
|
| 26 |
+
kl3m-002-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
|
| 27 |
for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
|
| 28 |
with a focus on low toxicity and high efficiency.
|
| 29 |
|
| 30 |
+
Given its small size and lack of instruction-aligned training data, kl3m-002-170m is best suited for use either in
|
| 31 |
SLM fine-tuning or as part of training larger models without using unethical data or models.
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## Model Details
|
| 34 |
|
|
|
|
| 35 |
- **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
|
| 36 |
+
- **Size**: 170 million parameters
|
| 37 |
+
- **Hidden Size**: 1024
|
| 38 |
+
- **Layers**: 16
|
| 39 |
+
- **Attention Heads**: 16
|
| 40 |
+
- **Key-Value Heads**: 8
|
| 41 |
+
- **Intermediate Size**: 1024
|
| 42 |
+
- **Max Sequence Length**: 4,096 tokens (true size, no sliding window)
|
| 43 |
+
- **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
|
| 44 |
- **Language(s)**: Primarily English
|
| 45 |
+
- **Training Objective**: Next token prediction
|
| 46 |
- **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
|
| 47 |
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
| 48 |
- **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
|
| 49 |
|
| 50 |
+
## Use Cases
|
| 51 |
+
|
| 52 |
+
kl3m-002-170m is particularly effective for:
|
| 53 |
+
|
| 54 |
+
- Basic regulatory question answering
|
| 55 |
+
- Contract provision drafting
|
| 56 |
+
- Structured JSON information extraction
|
| 57 |
+
- Foundation for downstream optimization
|
| 58 |
+
- Base model for domain-specific fine-tuning
|
| 59 |
+
|
| 60 |
+
## Performance
|
| 61 |
|
| 62 |
### Perplexity Scores
|
| 63 |
| Dataset | Score |
|
|
|
|
| 76 |
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
|
| 77 |
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
|
| 78 |
|
| 79 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
Basic usage for text generation:
|
| 82 |
|
| 83 |
```python
|
| 84 |
import json
|
|
|
|
| 108 |
]
|
| 109 |
```
|
| 110 |
|
| 111 |
+
### Contract Example
|
| 112 |
+
|
| 113 |
```python
|
| 114 |
text = "Governing Law.\n"
|
| 115 |
print(
|
|
|
|
| 131 |
]
|
| 132 |
```
|
| 133 |
|
| 134 |
+
### Generation Parameters
|
| 135 |
+
|
| 136 |
+
The model supports various parameters to control the generation process:
|
| 137 |
+
|
| 138 |
+
- `temperature`: Controls randomness (lower = more deterministic)
|
| 139 |
+
- `top_p`: Nucleus sampling parameter (lower = more focused)
|
| 140 |
+
- `top_k`: Limits vocabulary selection to top k tokens
|
| 141 |
+
- `max_new_tokens`: Maximum number of tokens to generate
|
| 142 |
+
- `do_sample`: Whether to use sampling vs. greedy decoding
|
| 143 |
+
- `num_return_sequences`: Number of different sequences to generate
|
| 144 |
+
|
| 145 |
+
## Training
|
| 146 |
+
|
| 147 |
+
The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
|
| 148 |
+
being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
|
| 149 |
|
| 150 |
The model implements several techniques during training:
|
| 151 |
|
| 152 |
- Hybrid NTP and SFT cotraining
|
| 153 |
- Dynamic, document-aware segmentation
|
| 154 |
- Randomized padding
|
| 155 |
+
- Traditional fixed-attention mechanisms
|
| 156 |
|
| 157 |
+
### Training Data
|
| 158 |
|
| 159 |
+
While the original training data collection and training infrastructure relies on software that was not donated by
|
| 160 |
+
273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
|
| 161 |
|
| 162 |
+
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
|
| 163 |
|
| 164 |
+
Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
|
| 165 |
+
zero-cost distribution model as soon as we can obtain additional support.
|
| 166 |
|
| 167 |
+
This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
|
| 168 |
+
we believe is 100% public domain material. However, so as to enforce maximum transparency to all
|
| 169 |
+
downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
|
| 170 |
+
|
| 171 |
+
## Intended Usage
|
| 172 |
+
|
| 173 |
+
This model is intended for use in:
|
| 174 |
+
|
| 175 |
+
- Legal and regulatory document processing systems
|
| 176 |
+
- Contract drafting assistance
|
| 177 |
+
- Financial and enterprise document workflows
|
| 178 |
+
- Educational contexts for learning about domain-specific language models
|
| 179 |
+
- Research on small, efficient language models
|
| 180 |
+
|
| 181 |
+
## Special Tokens
|
| 182 |
+
|
| 183 |
+
kl3m-002-170m uses the following special tokens:
|
| 184 |
|
| 185 |
+
- `<s>` (ID: 0): Beginning of sequence token (BOS)
|
| 186 |
+
- `</s>` (ID: 1): End of sequence token (EOS)
|
| 187 |
+
- `<pad>` (ID: 2): Padding token
|
| 188 |
|
| 189 |
+
## Limitations
|
| 190 |
|
| 191 |
+
- Limited to a 4,096 token context window
|
| 192 |
+
- As a small language model (170M parameters), it has limited general knowledge
|
| 193 |
+
- Not instruction-tuned or aligned with human preferences
|
| 194 |
+
- May generate plausible-sounding but incorrect legal or regulatory text
|
| 195 |
+
- Not a substitute for professional legal advice or domain expertise
|
| 196 |
+
- Performance is optimized for legal and financial domains; general performance may be lower
|
| 197 |
+
|
| 198 |
+
## Ethical Considerations
|
| 199 |
+
|
| 200 |
+
- This model should not be used to generate legal advice without human expert review
|
| 201 |
+
- The model may reflect biases present in the training data despite efforts to use clean data
|
| 202 |
+
- While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
|
| 203 |
+
|
| 204 |
+
## Source
|
| 205 |
+
|
| 206 |
+
[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
|
| 207 |
+
|
| 208 |
+
## References
|
| 209 |
+
|
| 210 |
+
- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
|
| 211 |
+
- Additional tokenizer, dataset, and model publications are pending.
|
| 212 |
|
| 213 |
## Citation
|
| 214 |
|
| 215 |
+
```bibtex
|
| 216 |
+
@misc{kl3m-002-170m,
|
| 217 |
+
author = {ALEA Institute},
|
| 218 |
+
title = {kl3m-002-170m: A Small Language Model for Legal and Regulatory Text},
|
| 219 |
+
year = {2024},
|
| 220 |
+
publisher = {Hugging Face},
|
| 221 |
+
howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-170m}}
|
| 222 |
+
}
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
## License
|
| 226 |
+
|
| 227 |
+
This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
|
| 228 |
+
|
| 229 |
+
The model weights are released under the CC-BY 4.0 License.
|
| 230 |
|
| 231 |
## Contact
|
| 232 |
|
| 233 |
+
The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
|
| 234 |
+
|
| 235 |
+
- GitHub: https://github.com/alea-institute/kl3m-model-research
|
| 236 |
+
- Email: [email protected]
|
| 237 |
+
- Website: https://aleainstitute.ai
|
| 238 |
|
| 239 |
+

|