Text Generation
Safetensors
English
Chinese
xTimeCrystal commited on
Commit
770709d
·
verified ·
1 Parent(s): 10a51de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -90
README.md CHANGED
@@ -1,4 +1,4 @@
1
- ---
2
  license: apache-2.0
3
  datasets:
4
  - openbmb/Ultra-FineWeb
@@ -8,62 +8,50 @@ language:
8
  - en
9
  - zh
10
  pipeline_tag: text-generation
11
- ---
12
-
13
- # Model Card for Model ID
14
 
15
- <!-- Provide a quick summary of what the model is/does. -->
16
 
17
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
18
 
19
  ## Model Details
20
 
21
  ### Model Description
22
 
23
- <!-- Provide a longer summary of what this model is. -->
24
-
25
- - **Developed by:** xTimeCrystal
26
- - **Model type:** Softmax self-attention decoder-only transformer
27
- - **Language(s) (NLP):** English, Chinese, Python
28
- - **License:** Apache 2.0
29
-
30
- This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.
31
 
32
- The main techniques used were:
33
- - Adaptive Muon optimizer: Based on the Muon optimizer, this allows the model to be trained with exceptional data efficiency (2.1x AdamW). Furthermore, the momentum buffer can be stored in bf16, additionally lowering VRAM usage.
34
 
35
- - Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
36
 
37
- - Efficient data bin-packing: as self attention relies heavily on attention sinks, all the sequences started with the start token ('\<s\>') and were truncated at 2048 tokens. However, this led to inefficiencies as over 70% of the processed data was padding. Thus, to alleviate this issue, we used a simple bin packing algorithm that tries to concatenate sequences to all have lengths close to 2048. After this operation, all the sequences had less than 5% padding.
 
 
 
 
 
 
38
 
39
- - Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (\~30% decrease) VRAM usage and much higher (\~20% increase) throughput.
40
 
41
- - ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([\[1\]](https://arxiv.org/abs/2109.08668v2), [\[2\]](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
42
 
43
- - Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
44
 
45
- - QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
46
 
47
- Overall, these techniques allowed the model to be losslessly trained for 110k steps with a massive batch size of 64 x 2048 tokens without gradient accumulation while still fitting in under 30GB VRAM and being completely spike-free:
48
 
49
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
50
 
51
- ## Uses
52
 
53
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
54
-
55
- ## How to Get Started with the Model
56
-
57
- Download the files from the repo and place all the files in the same folder. Then, run the cells in the notebook.
58
-
59
- Import the model using this cell.
60
  ```python
61
  import torch
62
- import torch.nn as nn
63
- import torch.optim as optim
64
- import torch.nn.functional as F
65
  from safetensors import safe_open
66
-
67
  from model import Transformer as Model
68
  from transformers import PreTrainedTokenizerFast
69
 
@@ -72,11 +60,10 @@ config = {
72
  'num_heads': 12,
73
  'vocab_size': 32768,
74
  'input_dims': 768,
75
- 'hidden_dims': 3072,
76
  }
77
 
78
  device = "cuda" if torch.cuda.is_available() else "cpu"
79
-
80
  torch.set_default_device(device)
81
 
82
  model = Model(**config)
@@ -85,106 +72,78 @@ model.bfloat16()
85
 
86
  saved_states = {}
87
  with safe_open("./model.safetensors", framework="pt", device=device) as f:
88
- for key in f.keys():
89
- saved_states[key] = f.get_tensor(key)
90
  model.load_state_dict(saved_states)
91
-
92
  model.eval()
93
 
94
  tokenizer = PreTrainedTokenizerFast.from_pretrained("./")
95
  ```
96
 
97
- Then prompt it to generate a function to compute the n-th Fibonacci number.
 
98
  ```python
99
  tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
100
-
101
  current = tokenizer.decode(tokens)
102
  print(current, end="")
103
 
104
  temperature = 1e-4
105
-
106
  for _ in range(128):
107
-
108
  tok = torch.tensor(tokens).reshape(1, -1)
109
-
110
  logits = model(tok)
111
-
112
- nxt = torch.multinomial(torch.softmax(logits[:, -1].float()/temperature, dim=-1).squeeze(), num_samples=1).item()
113
-
 
114
  tokens += [nxt]
115
-
116
  print(tokenizer.decode(tokens).replace(current, "", 1), end="")
117
-
118
  current = tokenizer.decode(tokens)
119
  ```
120
 
121
- You should see output like this!
122
  ```
123
  <s> def fibonacci(n: int):
124
  if n < 2:
125
  return n
126
  return fibonacci(n - 1) + fibonacci(n - 2)
127
 
128
-
129
  def fibonacci_recursive(n: int):
130
  if n < 2:
131
  return n
132
  return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
133
 
134
-
135
  def fibonacci_iterative(n: int):
136
  if n < 2:
137
  return n
138
  return fibonacci_iterative
139
  ```
140
 
141
- Increase temperature to reduce repetitions and use different prompts to get other interesting results.
142
-
143
- For example, if we prompt it to generate the digits of pi (using '''Digits of pi:''' as the seed), we see it recite the first 20(!) digits of pi correctly. (temperature=0.0001)
144
-
145
- ```
146
- <s> Digits of pi:
147
-
148
- # What is the value of pi?
149
-
150
- ## What is the value of pi?
151
-
152
- The value of pi is 3.14159265358979323846273284627328462732846273284627328462732846273284627328462732846273284627328462
153
- ```
154
 
155
- We can also get it to recall the chemical formula for water with the seed prompt '''The chemical formula of water is'''. Setting the temperature to a higher value prevents repetition, and plus we get to see the model's writing ability. (temperature=0.7)
156
- ```
157
- <s> The chemical formula for water is H2O. What does it mean?
158
- The chemical formula for water is H2O. What does it mean?
159
- Water is the purest liquid on Earth. It is the basis of life. Water is found in the soil, rivers, lakes, oceans, and the ocean. Water is also found in our bodies. Water is found in everything we take in on a daily basis. Water is essential for life. Water is found in the cells, tissues, and organs of all living things. Water is a key element of life because it enables the creation and maintenance of the various chemical and physical processes
160
- ```
161
 
162
- Impressive! Now let's test it on a classic example, '''The purpose of life'''. What does it think the purpose of life is? (temperature=0.8)
163
- ```
164
- <s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
165
- The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the
166
- ```
 
167
 
168
- Finally, let's see if it can write a section on Charles Darwin. (temperature=0.8)
169
- ```
170
- <s> Charles Darwin: The Origin of Species
171
- Suggested Citation: (Author's description, 2007-08-02)
172
- In the early 1900s the scientific community found that Darwin's theories would provide a mechanism for the further evolutionary history of living beings, assuming there was not been a series of intelligent, representative "histors" of life. Through careful research, Darwin's theory of the Origin of Species proved to be compatible with a single evolutionary process, that of speciation (Darwin, 1886). In other words, Darwin had
173
- ```
174
 
175
- Unfortunately Charles Darwin died before 1886, but at least it got the name of the book correct!
176
 
177
  ### Bias, Risks, and Limitations
178
 
179
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
180
 
181
- Sadly, a 200M model is not perfect. For example, when prompted for '''The radius of the Earth''', it generates the text:
182
  ```
183
  <s> The radius of the Earth is a measure of almost exactly 375,000 miles.
184
- Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. A few years later, an Australian astronomer determined that our planet’s mass is 12 times as massive as Earth.
185
- Now, using data from NASA’s GOES-1 spacecraft, the Australian astronomer believes that we are at least 34 billion years old and possesses
186
  ```
187
- which is off by a factor of around 100. Thus, despite impressive factuality in some areas and powerful language capabilities, all outputs from this model should be reviewed carefully.
 
188
 
189
  ## Citation
190
 
 
1
+ ```yaml
2
  license: apache-2.0
3
  datasets:
4
  - openbmb/Ultra-FineWeb
 
8
  - en
9
  - zh
10
  pipeline_tag: text-generation
11
+ ```
 
 
12
 
13
+ # Model Card for MiniModel-200M-Base
14
 
15
+ This model card provides an overview of **MiniModel-200M-Base**, a highly efficient 200M-parameter decoder-only transformer trained with state-of-the-art techniques for maximum data and compute efficiency.
16
 
17
  ## Model Details
18
 
19
  ### Model Description
20
 
21
+ - **Developed by:** xTimeCrystal
22
+ - **Model type:** Softmax self-attention decoder-only transformer
23
+ - **Languages:** English, Chinese, Python
24
+ - **License:** Apache 2.0
 
 
 
 
25
 
26
+ This model leverages cutting-edge training techniques to achieve strong performance with only 10B tokens of training data, trained in just one day on a single RTX 5090 GPU. As demonstrated below, it handles diverse tasks—from factual recall to coherent article generation—despite its small size.
 
27
 
28
+ Key innovations include:
29
 
30
+ - **Adaptive Muon optimizer**: Based on the Muon optimizer, it delivers 2. the data efficiency of AdamW. Momentum buffers are stored in bfloat16, further reducing VRAM usage.
31
+ - **Aggressive data filtering**: A curated selection of high-quality educational content enhances performance in resource-constrained settings.
32
+ - **Efficient data bin-packing**: To minimize padding waste (originally >70%), sequences were concatenated via a bin-packing algorithm to reach near-full 2048-token lengths, reducing padding to <5%.
33
+ - **Float8 pretraining**: Training used bfloat16 master weights, fp8 (e4m3) casting with bfloat16 accumulation, and full bfloat16 backward passes. The attention mechanism was kept in bfloat16 to avoid loss degradation. This setup matches full bfloat16 performance while cutting VRAM usage by ~30% and boosting throughput by ~20%.
34
+ - **ReLU² activation**: This ultra-sparse activation outperforms SwiGLU ([1](https://arxiv.org/abs/2109.08668v2), [2](https://arxiv.org/abs/2402.03804)) while requiring only two matrix multiplications, marginally improving VRAM usage.
35
+ - **Full attention**: All layers use standard softmax attention (no sliding window or grouped-query attention), preserving capacity in a small model.
36
+ - **QK Norm without scalars**: Removing learnable scalars improved training stability by preventing loss spikes and excessive attention activations.
37
 
38
+ These optimizations enabled **lossless training for 110k steps** with a massive batch size of 64 × 2048 tokens **without gradient accumulation** while staying under 30 GB VRAM and remaining completely spike-free:
39
 
40
+ ![Training loss curve](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
41
 
42
+ ## Intended Uses
43
 
44
+ This model is designed for efficient inference and experimentation in low-resource environments. It is suitable for educational use, prototyping, and applications where model size and speed are critical. Users include researchers, developers, and hobbyists working with constrained hardware.
45
 
46
+ ## Getting Started
47
 
48
+ Download all files from the repository into a single folder and run the notebook cells.
49
 
50
+ ### Loading the Model
51
 
 
 
 
 
 
 
 
52
  ```python
53
  import torch
 
 
 
54
  from safetensors import safe_open
 
55
  from model import Transformer as Model
56
  from transformers import PreTrainedTokenizerFast
57
 
 
60
  'num_heads': 12,
61
  'vocab_size': 32768,
62
  'input_dims': 768,
63
+ 'hidden_dims': 3072,
64
  }
65
 
66
  device = "cuda" if torch.cuda.is_available() else "cpu"
 
67
  torch.set_default_device(device)
68
 
69
  model = Model(**config)
 
72
 
73
  saved_states = {}
74
  with safe_open("./model.safetensors", framework="pt", device=device) as f:
75
+ for key in f.keys():
76
+ saved_states[key] = f.get_tensor(key)
77
  model.load_state_dict(saved_states)
 
78
  model.eval()
79
 
80
  tokenizer = PreTrainedTokenizerFast.from_pretrained("./")
81
  ```
82
 
83
+ ### Example: Fibonacci Generation
84
+
85
  ```python
86
  tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
 
87
  current = tokenizer.decode(tokens)
88
  print(current, end="")
89
 
90
  temperature = 1e-4
 
91
  for _ in range(128):
 
92
  tok = torch.tensor(tokens).reshape(1, -1)
 
93
  logits = model(tok)
94
+ nxt = torch.multinomial(
95
+ torch.softmax(logits[:, -1].float() / temperature, dim=-1).squeeze(),
96
+ num_samples=1
97
+ ).item()
98
  tokens += [nxt]
 
99
  print(tokenizer.decode(tokens).replace(current, "", 1), end="")
 
100
  current = tokenizer.decode(tokens)
101
  ```
102
 
103
+ **Output:**
104
  ```
105
  <s> def fibonacci(n: int):
106
  if n < 2:
107
  return n
108
  return fibonacci(n - 1) + fibonacci(n - 2)
109
 
 
110
  def fibonacci_recursive(n: int):
111
  if n < 2:
112
  return n
113
  return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
114
 
 
115
  def fibonacci_iterative(n: int):
116
  if n < 2:
117
  return n
118
  return fibonacci_iterative
119
  ```
120
 
121
+ ### Additional Examples
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
+ - **Digits of π** (`temperature=0.0001`):
124
+ Correctly recites the first 20 digits: `3.14159265358979323846...`
 
 
 
 
125
 
126
+ - **“The purpose of life”** (`temperature=0.8`):
127
+ Produces a coherent, skill-focused philosophical reflection:
128
+ ```
129
+ <s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
130
+ The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the
131
+ ```
132
 
133
+ Additional examples can be found in the Jupyter notebook file.
 
 
 
 
 
134
 
135
+ > Tip: Increase `temperature` to reduce repetition and encourage creativity. Temperature = 0.8 is recommended for general use, while temperature = 0.0001 is recommended for factual recall.
136
 
137
  ### Bias, Risks, and Limitations
138
 
139
+ Despite strong performance in many areas, this 200M-parameter model is not infallible. For instance, when prompted with *“The radius of the Earth”*, it outputs:
140
 
 
141
  ```
142
  <s> The radius of the Earth is a measure of almost exactly 375,000 miles.
143
+ Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. ...
 
144
  ```
145
+
146
+ This is off by roughly two orders of magnitude (actual mean radius: ~3,959 miles). Users should **verify all factual claims** and avoid relying on the model for high-stakes decisions.
147
 
148
  ## Citation
149