Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
|
| 2 |
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
- openbmb/Ultra-FineWeb
|
|
@@ -8,62 +8,50 @@ language:
|
|
| 8 |
- en
|
| 9 |
- zh
|
| 10 |
pipeline_tag: text-generation
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
# Model Card for Model ID
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
This
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
|
| 21 |
### Model Description
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
- **
|
| 26 |
-
- **
|
| 27 |
-
- **Language(s) (NLP):** English, Chinese, Python
|
| 28 |
-
- **License:** Apache 2.0
|
| 29 |
-
|
| 30 |
-
This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.
|
| 31 |
|
| 32 |
-
|
| 33 |
-
- Adaptive Muon optimizer: Based on the Muon optimizer, this allows the model to be trained with exceptional data efficiency (2.1x AdamW). Furthermore, the momentum buffer can be stored in bf16, additionally lowering VRAM usage.
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
-
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 54 |
-
|
| 55 |
-
## How to Get Started with the Model
|
| 56 |
-
|
| 57 |
-
Download the files from the repo and place all the files in the same folder. Then, run the cells in the notebook.
|
| 58 |
-
|
| 59 |
-
Import the model using this cell.
|
| 60 |
```python
|
| 61 |
import torch
|
| 62 |
-
import torch.nn as nn
|
| 63 |
-
import torch.optim as optim
|
| 64 |
-
import torch.nn.functional as F
|
| 65 |
from safetensors import safe_open
|
| 66 |
-
|
| 67 |
from model import Transformer as Model
|
| 68 |
from transformers import PreTrainedTokenizerFast
|
| 69 |
|
|
@@ -72,11 +60,10 @@ config = {
|
|
| 72 |
'num_heads': 12,
|
| 73 |
'vocab_size': 32768,
|
| 74 |
'input_dims': 768,
|
| 75 |
-
'hidden_dims': 3072,
|
| 76 |
}
|
| 77 |
|
| 78 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 79 |
-
|
| 80 |
torch.set_default_device(device)
|
| 81 |
|
| 82 |
model = Model(**config)
|
|
@@ -85,106 +72,78 @@ model.bfloat16()
|
|
| 85 |
|
| 86 |
saved_states = {}
|
| 87 |
with safe_open("./model.safetensors", framework="pt", device=device) as f:
|
| 88 |
-
|
| 89 |
-
|
| 90 |
model.load_state_dict(saved_states)
|
| 91 |
-
|
| 92 |
model.eval()
|
| 93 |
|
| 94 |
tokenizer = PreTrainedTokenizerFast.from_pretrained("./")
|
| 95 |
```
|
| 96 |
|
| 97 |
-
|
|
|
|
| 98 |
```python
|
| 99 |
tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
|
| 100 |
-
|
| 101 |
current = tokenizer.decode(tokens)
|
| 102 |
print(current, end="")
|
| 103 |
|
| 104 |
temperature = 1e-4
|
| 105 |
-
|
| 106 |
for _ in range(128):
|
| 107 |
-
|
| 108 |
tok = torch.tensor(tokens).reshape(1, -1)
|
| 109 |
-
|
| 110 |
logits = model(tok)
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
|
|
|
| 114 |
tokens += [nxt]
|
| 115 |
-
|
| 116 |
print(tokenizer.decode(tokens).replace(current, "", 1), end="")
|
| 117 |
-
|
| 118 |
current = tokenizer.decode(tokens)
|
| 119 |
```
|
| 120 |
|
| 121 |
-
|
| 122 |
```
|
| 123 |
<s> def fibonacci(n: int):
|
| 124 |
if n < 2:
|
| 125 |
return n
|
| 126 |
return fibonacci(n - 1) + fibonacci(n - 2)
|
| 127 |
|
| 128 |
-
|
| 129 |
def fibonacci_recursive(n: int):
|
| 130 |
if n < 2:
|
| 131 |
return n
|
| 132 |
return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
|
| 133 |
|
| 134 |
-
|
| 135 |
def fibonacci_iterative(n: int):
|
| 136 |
if n < 2:
|
| 137 |
return n
|
| 138 |
return fibonacci_iterative
|
| 139 |
```
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
For example, if we prompt it to generate the digits of pi (using '''Digits of pi:''' as the seed), we see it recite the first 20(!) digits of pi correctly. (temperature=0.0001)
|
| 144 |
-
|
| 145 |
-
```
|
| 146 |
-
<s> Digits of pi:
|
| 147 |
-
|
| 148 |
-
# What is the value of pi?
|
| 149 |
-
|
| 150 |
-
## What is the value of pi?
|
| 151 |
-
|
| 152 |
-
The value of pi is 3.14159265358979323846273284627328462732846273284627328462732846273284627328462732846273284627328462
|
| 153 |
-
```
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
<s> The chemical formula for water is H2O. What does it mean?
|
| 158 |
-
The chemical formula for water is H2O. What does it mean?
|
| 159 |
-
Water is the purest liquid on Earth. It is the basis of life. Water is found in the soil, rivers, lakes, oceans, and the ocean. Water is also found in our bodies. Water is found in everything we take in on a daily basis. Water is essential for life. Water is found in the cells, tissues, and organs of all living things. Water is a key element of life because it enables the creation and maintenance of the various chemical and physical processes
|
| 160 |
-
```
|
| 161 |
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
The
|
| 166 |
-
|
|
|
|
| 167 |
|
| 168 |
-
|
| 169 |
-
```
|
| 170 |
-
<s> Charles Darwin: The Origin of Species
|
| 171 |
-
Suggested Citation: (Author's description, 2007-08-02)
|
| 172 |
-
In the early 1900s the scientific community found that Darwin's theories would provide a mechanism for the further evolutionary history of living beings, assuming there was not been a series of intelligent, representative "histors" of life. Through careful research, Darwin's theory of the Origin of Species proved to be compatible with a single evolutionary process, that of speciation (Darwin, 1886). In other words, Darwin had
|
| 173 |
-
```
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
### Bias, Risks, and Limitations
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
Sadly, a 200M model is not perfect. For example, when prompted for '''The radius of the Earth''', it generates the text:
|
| 182 |
```
|
| 183 |
<s> The radius of the Earth is a measure of almost exactly 375,000 miles.
|
| 184 |
-
Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles.
|
| 185 |
-
Now, using data from NASA’s GOES-1 spacecraft, the Australian astronomer believes that we are at least 34 billion years old and possesses
|
| 186 |
```
|
| 187 |
-
|
|
|
|
| 188 |
|
| 189 |
## Citation
|
| 190 |
|
|
|
|
| 1 |
+
```yaml
|
| 2 |
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
- openbmb/Ultra-FineWeb
|
|
|
|
| 8 |
- en
|
| 9 |
- zh
|
| 10 |
pipeline_tag: text-generation
|
| 11 |
+
```
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
# Model Card for MiniModel-200M-Base
|
| 14 |
|
| 15 |
+
This model card provides an overview of **MiniModel-200M-Base**, a highly efficient 200M-parameter decoder-only transformer trained with state-of-the-art techniques for maximum data and compute efficiency.
|
| 16 |
|
| 17 |
## Model Details
|
| 18 |
|
| 19 |
### Model Description
|
| 20 |
|
| 21 |
+
- **Developed by:** xTimeCrystal
|
| 22 |
+
- **Model type:** Softmax self-attention decoder-only transformer
|
| 23 |
+
- **Languages:** English, Chinese, Python
|
| 24 |
+
- **License:** Apache 2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
This model leverages cutting-edge training techniques to achieve strong performance with only 10B tokens of training data, trained in just one day on a single RTX 5090 GPU. As demonstrated below, it handles diverse tasks—from factual recall to coherent article generation—despite its small size.
|
|
|
|
| 27 |
|
| 28 |
+
Key innovations include:
|
| 29 |
|
| 30 |
+
- **Adaptive Muon optimizer**: Based on the Muon optimizer, it delivers 2.1× the data efficiency of AdamW. Momentum buffers are stored in bfloat16, further reducing VRAM usage.
|
| 31 |
+
- **Aggressive data filtering**: A curated selection of high-quality educational content enhances performance in resource-constrained settings.
|
| 32 |
+
- **Efficient data bin-packing**: To minimize padding waste (originally >70%), sequences were concatenated via a bin-packing algorithm to reach near-full 2048-token lengths, reducing padding to <5%.
|
| 33 |
+
- **Float8 pretraining**: Training used bfloat16 master weights, fp8 (e4m3) casting with bfloat16 accumulation, and full bfloat16 backward passes. The attention mechanism was kept in bfloat16 to avoid loss degradation. This setup matches full bfloat16 performance while cutting VRAM usage by ~30% and boosting throughput by ~20%.
|
| 34 |
+
- **ReLU² activation**: This ultra-sparse activation outperforms SwiGLU ([1](https://arxiv.org/abs/2109.08668v2), [2](https://arxiv.org/abs/2402.03804)) while requiring only two matrix multiplications, marginally improving VRAM usage.
|
| 35 |
+
- **Full attention**: All layers use standard softmax attention (no sliding window or grouped-query attention), preserving capacity in a small model.
|
| 36 |
+
- **QK Norm without scalars**: Removing learnable scalars improved training stability by preventing loss spikes and excessive attention activations.
|
| 37 |
|
| 38 |
+
These optimizations enabled **lossless training for 110k steps** with a massive batch size of 64 × 2048 tokens **without gradient accumulation** while staying under 30 GB VRAM and remaining completely spike-free:
|
| 39 |
|
| 40 |
+

|
| 41 |
|
| 42 |
+
## Intended Uses
|
| 43 |
|
| 44 |
+
This model is designed for efficient inference and experimentation in low-resource environments. It is suitable for educational use, prototyping, and applications where model size and speed are critical. Users include researchers, developers, and hobbyists working with constrained hardware.
|
| 45 |
|
| 46 |
+
## Getting Started
|
| 47 |
|
| 48 |
+
Download all files from the repository into a single folder and run the notebook cells.
|
| 49 |
|
| 50 |
+
### Loading the Model
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
```python
|
| 53 |
import torch
|
|
|
|
|
|
|
|
|
|
| 54 |
from safetensors import safe_open
|
|
|
|
| 55 |
from model import Transformer as Model
|
| 56 |
from transformers import PreTrainedTokenizerFast
|
| 57 |
|
|
|
|
| 60 |
'num_heads': 12,
|
| 61 |
'vocab_size': 32768,
|
| 62 |
'input_dims': 768,
|
| 63 |
+
'hidden_dims': 3072,
|
| 64 |
}
|
| 65 |
|
| 66 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
|
|
| 67 |
torch.set_default_device(device)
|
| 68 |
|
| 69 |
model = Model(**config)
|
|
|
|
| 72 |
|
| 73 |
saved_states = {}
|
| 74 |
with safe_open("./model.safetensors", framework="pt", device=device) as f:
|
| 75 |
+
for key in f.keys():
|
| 76 |
+
saved_states[key] = f.get_tensor(key)
|
| 77 |
model.load_state_dict(saved_states)
|
|
|
|
| 78 |
model.eval()
|
| 79 |
|
| 80 |
tokenizer = PreTrainedTokenizerFast.from_pretrained("./")
|
| 81 |
```
|
| 82 |
|
| 83 |
+
### Example: Fibonacci Generation
|
| 84 |
+
|
| 85 |
```python
|
| 86 |
tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
|
|
|
|
| 87 |
current = tokenizer.decode(tokens)
|
| 88 |
print(current, end="")
|
| 89 |
|
| 90 |
temperature = 1e-4
|
|
|
|
| 91 |
for _ in range(128):
|
|
|
|
| 92 |
tok = torch.tensor(tokens).reshape(1, -1)
|
|
|
|
| 93 |
logits = model(tok)
|
| 94 |
+
nxt = torch.multinomial(
|
| 95 |
+
torch.softmax(logits[:, -1].float() / temperature, dim=-1).squeeze(),
|
| 96 |
+
num_samples=1
|
| 97 |
+
).item()
|
| 98 |
tokens += [nxt]
|
|
|
|
| 99 |
print(tokenizer.decode(tokens).replace(current, "", 1), end="")
|
|
|
|
| 100 |
current = tokenizer.decode(tokens)
|
| 101 |
```
|
| 102 |
|
| 103 |
+
**Output:**
|
| 104 |
```
|
| 105 |
<s> def fibonacci(n: int):
|
| 106 |
if n < 2:
|
| 107 |
return n
|
| 108 |
return fibonacci(n - 1) + fibonacci(n - 2)
|
| 109 |
|
|
|
|
| 110 |
def fibonacci_recursive(n: int):
|
| 111 |
if n < 2:
|
| 112 |
return n
|
| 113 |
return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
|
| 114 |
|
|
|
|
| 115 |
def fibonacci_iterative(n: int):
|
| 116 |
if n < 2:
|
| 117 |
return n
|
| 118 |
return fibonacci_iterative
|
| 119 |
```
|
| 120 |
|
| 121 |
+
### Additional Examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
- **Digits of π** (`temperature=0.0001`):
|
| 124 |
+
Correctly recites the first 20 digits: `3.14159265358979323846...`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
- **“The purpose of life”** (`temperature=0.8`):
|
| 127 |
+
Produces a coherent, skill-focused philosophical reflection:
|
| 128 |
+
```
|
| 129 |
+
<s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
|
| 130 |
+
The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the
|
| 131 |
+
```
|
| 132 |
|
| 133 |
+
Additional examples can be found in the Jupyter notebook file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
+
> Tip: Increase `temperature` to reduce repetition and encourage creativity. Temperature = 0.8 is recommended for general use, while temperature = 0.0001 is recommended for factual recall.
|
| 136 |
|
| 137 |
### Bias, Risks, and Limitations
|
| 138 |
|
| 139 |
+
Despite strong performance in many areas, this 200M-parameter model is not infallible. For instance, when prompted with *“The radius of the Earth”*, it outputs:
|
| 140 |
|
|
|
|
| 141 |
```
|
| 142 |
<s> The radius of the Earth is a measure of almost exactly 375,000 miles.
|
| 143 |
+
Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. ...
|
|
|
|
| 144 |
```
|
| 145 |
+
|
| 146 |
+
This is off by roughly two orders of magnitude (actual mean radius: ~3,959 miles). Users should **verify all factual claims** and avoid relying on the model for high-stakes decisions.
|
| 147 |
|
| 148 |
## Citation
|
| 149 |
|