Update README.md
Browse filesadd info about kv cache saving
README.md
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
datasets:
|
| 3 |
- cerebras/SlimPajama-627B
|
| 4 |
language:
|
|
@@ -65,6 +66,8 @@ print(response[0]["generated_text"])
|
|
| 65 |
|
| 66 |
## The LCKV Collection
|
| 67 |
|
|
|
|
|
|
|
| 68 |
This model was first initialized from the [TinyLlama 2.5T checkpoint](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), then continued pre-training on 100B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
| 69 |
|
| 70 |
Since the model structure has been changed, the initialization cannot inherit the performance of the TinyLlama checkpoint, but it effectively boosts the training process compared to pre-training from scratch.
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: transformers
|
| 3 |
datasets:
|
| 4 |
- cerebras/SlimPajama-627B
|
| 5 |
language:
|
|
|
|
| 66 |
|
| 67 |
## The LCKV Collection
|
| 68 |
|
| 69 |
+
The model has 2 warmup layers. i.e. 3/22 KV cache of a standard TinyLlama.
|
| 70 |
+
|
| 71 |
This model was first initialized from the [TinyLlama 2.5T checkpoint](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), then continued pre-training on 100B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
| 72 |
|
| 73 |
Since the model structure has been changed, the initialization cannot inherit the performance of the TinyLlama checkpoint, but it effectively boosts the training process compared to pre-training from scratch.
|