alea-institute
/

kl3m-002-170m

@@ -10,6 +10,7 @@ tags:
 - financial
 - enterprise
 - slm
 date: '2024-02-20T00:00:00.000Z'
 pipeline_tag: text-generation
 widget:
@@ -18,51 +19,45 @@ widget:
  - do_sample: True
 ---
-# kl3m-170m Model
-kl3m-170m is a (very) small language model (SLM) model trained on clean, legally-permissible data. Originally
 developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
-kl3m-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
 for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
 with a focus on low toxicity and high efficiency.
-Given its small size and lack of instruction-aligned training data, kl3m-170m is best suited for use either in
 SLM fine-tuning or as part of training larger models without using unethical data or models.
-The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP.  A similar model is
-being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
-## Source
-[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
-## Training Data
-While the original training data collection and training infrastructure relies on software that was not donated by
-273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
-[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
-Data is available upon request at this time via S3 under a Requester Pays model.  We are actively working on a
-zero-cost distribution model as soon as we can obtain additional support.
-This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
-we believe is 100% public domain material. However, so as to enforce maximum transparency to all
-downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
 ## Model Details
-### Summary
 - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
-- **Parameters**: 170 million
-- **Context Window**: 4,096 tokens (true size, no sliding window)
 - **Language(s)**: Primarily English
-- **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
 - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
 - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
 - **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
-## Performance Metrics
 ### Perplexity Scores
 | Dataset        | Score  |
@@ -81,15 +76,9 @@ larger models as of its training data.
 - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
 - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
-## Use Cases
-- Basic regulatory question answering
-- Contract provision drafting
-- Structured JSON information extraction
-- Foundation for downstream optimization
-- Base model for domain-specific fine-tuning
-## Getting Started
 ```python
 import json
@@ -119,7 +108,8 @@ print(
 ]
 ```
-## Contract Example
 ```python
 text = "Governing Law.\n"
 print(
@@ -141,41 +131,109 @@ print(
 ]
 ```
-## Technical Implementation
 The model implements several techniques during training:
 - Hybrid NTP and SFT cotraining
 - Dynamic, document-aware segmentation
 - Randomized padding
-- Traditional fixed- attention mechanisms
-## License
-This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
-The model weights are released under the CC-BY 4.0 License.
-## Contact
-The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
-- GitHub: https://github.com/alea-institute/kl3m-model-research
-- Email: [email protected]
-- Website: https://aleainstitute.ai
-## Acknowledgments
-Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
 ## Citation
-Tokenizer, dataset, and model publications are pending.
 ## Contact
-For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
-create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
-![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)

 - financial
 - enterprise
 - slm
+- gpt-neox
 date: '2024-02-20T00:00:00.000Z'
 pipeline_tag: text-generation
 widget:
  - do_sample: True
 ---
+# kl3m-002-170m
+kl3m-002-170m is a (very) small language model (SLM) trained on clean, legally-permissible data. Originally
 developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
+kl3m-002-170m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
 for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
 with a focus on low toxicity and high efficiency.
+Given its small size and lack of instruction-aligned training data, kl3m-002-170m is best suited for use either in
 SLM fine-tuning or as part of training larger models without using unethical data or models.
 ## Model Details
 - **Architecture**: GPT-NeoX (i.e., ~GPT-3 architecture)
+- **Size**: 170 million parameters
+- **Hidden Size**: 1024
+- **Layers**: 16
+- **Attention Heads**: 16
+- **Key-Value Heads**: 8
+- **Intermediate Size**: 1024
+- **Max Sequence Length**: 4,096 tokens (true size, no sliding window)
+- **Tokenizer**: [kl3m-001-32k](https://huggingface.co/alea-institute/kl3m-001-32k) BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
 - **Language(s)**: Primarily English
+- **Training Objective**: Next token prediction
 - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
 - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
 - **Hardware Requirements**: Runs real-time in fp32 on MacBook Air M1
+## Use Cases
+kl3m-002-170m is particularly effective for:
+- Basic regulatory question answering
+- Contract provision drafting
+- Structured JSON information extraction
+- Foundation for downstream optimization
+- Base model for domain-specific fine-tuning
+## Performance
 ### Perplexity Scores
 | Dataset        | Score  |
 - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
 - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
+## Usage
+Basic usage for text generation:
 ```python
 import json
 ]
 ```
+### Contract Example
 ```python
 text = "Governing Law.\n"
 print(
 ]
 ```
+### Generation Parameters
+The model supports various parameters to control the generation process:
+- `temperature`: Controls randomness (lower = more deterministic)
+- `top_p`: Nucleus sampling parameter (lower = more focused)
+- `top_k`: Limits vocabulary selection to top k tokens
+- `max_new_tokens`: Maximum number of tokens to generate
+- `do_sample`: Whether to use sampling vs. greedy decoding
+- `num_return_sequences`: Number of different sequences to generate
+## Training
+The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
+being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
 The model implements several techniques during training:
 - Hybrid NTP and SFT cotraining
 - Dynamic, document-aware segmentation
 - Randomized padding
+- Traditional fixed-attention mechanisms
+### Training Data
+While the original training data collection and training infrastructure relies on software that was not donated by
+273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
+[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
+Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
+zero-cost distribution model as soon as we can obtain additional support.
+This model, the original `kl3m-002-170m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
+we believe is 100% public domain material. However, so as to enforce maximum transparency to all
+downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
+## Intended Usage
+This model is intended for use in:
+- Legal and regulatory document processing systems
+- Contract drafting assistance
+- Financial and enterprise document workflows
+- Educational contexts for learning about domain-specific language models
+- Research on small, efficient language models
+## Special Tokens
+kl3m-002-170m uses the following special tokens:
+- `<s>` (ID: 0): Beginning of sequence token (BOS)
+- `</s>` (ID: 1): End of sequence token (EOS)
+- `<pad>` (ID: 2): Padding token
+## Limitations
+- Limited to a 4,096 token context window
+- As a small language model (170M parameters), it has limited general knowledge
+- Not instruction-tuned or aligned with human preferences
+- May generate plausible-sounding but incorrect legal or regulatory text
+- Not a substitute for professional legal advice or domain expertise
+- Performance is optimized for legal and financial domains; general performance may be lower
+## Ethical Considerations
+- This model should not be used to generate legal advice without human expert review
+- The model may reflect biases present in the training data despite efforts to use clean data
+- While trained on ethically sourced data, users should verify outputs for accuracy and appropriateness
+## Source
+[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
+## References
+- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
+- Additional tokenizer, dataset, and model publications are pending.
 ## Citation
+```bibtex
+@misc{kl3m-002-170m,
+  author = {ALEA Institute},
+  title = {kl3m-002-170m: A Small Language Model for Legal and Regulatory Text},
+  year = {2024},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/alea-institute/kl3m-002-170m}}
+}
+```
+## License
+This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
+The model weights are released under the CC-BY 4.0 License.
 ## Contact
+The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
+- GitHub: https://github.com/alea-institute/kl3m-model-research
+- Email: [email protected]
+- Website: https://aleainstitute.ai
+![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)