xTimeCrystal
/

MiniModel-200M-Base

@@ -1,4 +1,4 @@
----
 license: apache-2.0
 datasets:
 - openbmb/Ultra-FineWeb
@@ -8,62 +8,50 @@ language:
 - en
 - zh
 pipeline_tag: text-generation
----
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** xTimeCrystal
-- **Model type:** Softmax self-attention decoder-only transformer
-- **Language(s) (NLP):** English, Chinese, Python
-- **License:** Apache 2.0
-This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.
-The main techniques used were:
-- Adaptive Muon optimizer: Based on the Muon optimizer, this allows the model to be trained with exceptional data efficiency (2.1x AdamW). Furthermore, the momentum buffer can be stored in bf16, additionally lowering VRAM usage.
-- Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
-- Efficient data bin-packing: as self attention relies heavily on attention sinks, all the sequences started with the start token ('\<s\>') and were truncated at 2048 tokens. However, this led to inefficiencies as over 70% of the processed data was padding. Thus, to alleviate this issue, we used a simple bin packing algorithm that tries to concatenate sequences to all have lengths close to 2048. After this operation, all the sequences had less than 5% padding.
-- Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (\~30% decrease) VRAM usage and much higher (\~20% increase) throughput.
-- ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([\[1\]](https://arxiv.org/abs/2109.08668v2), [\[2\]](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
-- Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
-- QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
-Overall, these techniques allowed the model to be losslessly trained for 110k steps with a massive batch size of 64 x 2048 tokens without gradient accumulation while still fitting in under 30GB VRAM and being completely spike-free:
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-## How to Get Started with the Model
-Download the files from the repo and place all the files in the same folder. Then, run the cells in the notebook.
-Import the model using this cell.
 ```python
 import torch
-import torch.nn as nn
-import torch.optim as optim
-import torch.nn.functional as F
 from safetensors import safe_open
 from model import Transformer as Model
 from transformers import PreTrainedTokenizerFast
@@ -72,11 +60,10 @@ config = {
     'num_heads': 12,
     'vocab_size': 32768,
     'input_dims': 768,
-    'hidden_dims': 3072,
 }
 device = "cuda" if torch.cuda.is_available() else "cpu"
 torch.set_default_device(device)
 model = Model(**config)
@@ -85,106 +72,78 @@ model.bfloat16()
 saved_states = {}
 with safe_open("./model.safetensors", framework="pt", device=device) as f:
-   for key in f.keys():
-       saved_states[key] = f.get_tensor(key)
 model.load_state_dict(saved_states)
 model.eval()
 tokenizer = PreTrainedTokenizerFast.from_pretrained("./")
 ```
-Then prompt it to generate a function to compute the n-th Fibonacci number.
 ```python
 tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
 current = tokenizer.decode(tokens)
 print(current, end="")
 temperature = 1e-4
 for _ in range(128):
     tok = torch.tensor(tokens).reshape(1, -1)
     logits = model(tok)
-    nxt = torch.multinomial(torch.softmax(logits[:, -1].float()/temperature, dim=-1).squeeze(), num_samples=1).item()
     tokens += [nxt]
     print(tokenizer.decode(tokens).replace(current, "", 1), end="")
     current = tokenizer.decode(tokens)
 ```
-You should see output like this!
 ```
 <s> def fibonacci(n: int):
     if n < 2:
         return n
     return fibonacci(n - 1) + fibonacci(n - 2)
 def fibonacci_recursive(n: int):
     if n < 2:
         return n
     return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
 def fibonacci_iterative(n: int):
     if n < 2:
         return n
     return fibonacci_iterative
 ```
-Increase temperature to reduce repetitions and use different prompts to get other interesting results.
-For example, if we prompt it to generate the digits of pi (using '''Digits of pi:''' as the seed), we see it recite the first 20(!) digits of pi correctly. (temperature=0.0001)
-```
-<s> Digits of pi:
-# What is the value of pi?
-## What is the value of pi?
-The value of pi is 3.14159265358979323846273284627328462732846273284627328462732846273284627328462732846273284627328462
-```
-We can also get it to recall the chemical formula for water with the seed prompt '''The chemical formula of water is'''. Setting the temperature to a higher value prevents repetition, and plus we get to see the model's writing ability. (temperature=0.7)
-```
-<s> The chemical formula for water is H2O. What does it mean?
-The chemical formula for water is H2O. What does it mean?
-Water is the purest liquid on Earth. It is the basis of life. Water is found in the soil, rivers, lakes, oceans, and the ocean. Water is also found in our bodies. Water is found in everything we take in on a daily basis. Water is essential for life. Water is found in the cells, tissues, and organs of all living things. Water is a key element of life because it enables the creation and maintenance of the various chemical and physical processes
-```
-Impressive! Now let's test it on a classic example, '''The purpose of life'''. What does it think the purpose of life is? (temperature=0.8)
-```
-<s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
-The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the
-```
-Finally, let's see if it can write a section on Charles Darwin. (temperature=0.8)
-```
-<s> Charles Darwin: The Origin of Species
-Suggested Citation: (Author's description, 2007-08-02)
-In the early 1900s the scientific community found that Darwin's theories would provide a mechanism for the further evolutionary history of living beings, assuming there was not been a series of intelligent, representative "histors" of life. Through careful research, Darwin's theory of the Origin of Species proved to be compatible with a single evolutionary process, that of speciation (Darwin, 1886). In other words, Darwin had
-```
-Unfortunately Charles Darwin died before 1886, but at least it got the name of the book correct!
 ### Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-Sadly, a 200M model is not perfect. For example, when prompted for '''The radius of the Earth''', it generates the text:
 ```
 <s> The radius of the Earth is a measure of almost exactly 375,000 miles.
-Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. A few years later, an Australian astronomer determined that our planet’s mass is 12 times as massive as Earth.
-Now, using data from NASA’s GOES-1 spacecraft, the Australian astronomer believes that we are at least 34 billion years old and possesses
 ```
-which is off by a factor of around 100. Thus, despite impressive factuality in some areas and powerful language capabilities, all outputs from this model should be reviewed carefully.
 ## Citation

+```yaml
 license: apache-2.0
 datasets:
 - openbmb/Ultra-FineWeb
 - en
 - zh
 pipeline_tag: text-generation
+```
+# Model Card for MiniModel-200M-Base
+This model card provides an overview of **MiniModel-200M-Base**, a highly efficient 200M-parameter decoder-only transformer trained with state-of-the-art techniques for maximum data and compute efficiency.
 ## Model Details
 ### Model Description
+- **Developed by:** xTimeCrystal
+- **Model type:** Softmax self-attention decoder-only transformer
+- **Languages:** English, Chinese, Python
+- **License:** Apache 2.0
+This model leverages cutting-edge training techniques to achieve strong performance with only 10B tokens of training data, trained in just one day on a single RTX 5090 GPU. As demonstrated below, it handles diverse tasks—from factual recall to coherent article generation—despite its small size.
+Key innovations include:
+- **Adaptive Muon optimizer**: Based on the Muon optimizer, it delivers 2.1× the data efficiency of AdamW. Momentum buffers are stored in bfloat16, further reducing VRAM usage.
+- **Aggressive data filtering**: A curated selection of high-quality educational content enhances performance in resource-constrained settings.
+- **Efficient data bin-packing**: To minimize padding waste (originally >70%), sequences were concatenated via a bin-packing algorithm to reach near-full 2048-token lengths, reducing padding to <5%.
+- **Float8 pretraining**: Training used bfloat16 master weights, fp8 (e4m3) casting with bfloat16 accumulation, and full bfloat16 backward passes. The attention mechanism was kept in bfloat16 to avoid loss degradation. This setup matches full bfloat16 performance while cutting VRAM usage by ~30% and boosting throughput by ~20%.
+- **ReLU² activation**: This ultra-sparse activation outperforms SwiGLU ([1](https://arxiv.org/abs/2109.08668v2), [2](https://arxiv.org/abs/2402.03804)) while requiring only two matrix multiplications, marginally improving VRAM usage.
+- **Full attention**: All layers use standard softmax attention (no sliding window or grouped-query attention), preserving capacity in a small model.
+- **QK Norm without scalars**: Removing learnable scalars improved training stability by preventing loss spikes and excessive attention activations.
+These optimizations enabled **lossless training for 110k steps** with a massive batch size of 64 × 2048 tokens **without gradient accumulation** while staying under 30 GB VRAM and remaining completely spike-free:
+![Training loss curve](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
+## Intended Uses
+This model is designed for efficient inference and experimentation in low-resource environments. It is suitable for educational use, prototyping, and applications where model size and speed are critical. Users include researchers, developers, and hobbyists working with constrained hardware.
+## Getting Started
+Download all files from the repository into a single folder and run the notebook cells.
+### Loading the Model
 ```python
 import torch
 from safetensors import safe_open
 from model import Transformer as Model
 from transformers import PreTrainedTokenizerFast
     'num_heads': 12,
     'vocab_size': 32768,
     'input_dims': 768,
+    'hidden_dims': 3072,
 }
 device = "cuda" if torch.cuda.is_available() else "cpu"
 torch.set_default_device(device)
 model = Model(**config)
 saved_states = {}
 with safe_open("./model.safetensors", framework="pt", device=device) as f:
+    for key in f.keys():
+        saved_states[key] = f.get_tensor(key)
 model.load_state_dict(saved_states)
 model.eval()
 tokenizer = PreTrainedTokenizerFast.from_pretrained("./")
 ```
+### Example: Fibonacci Generation
 ```python
 tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']
 current = tokenizer.decode(tokens)
 print(current, end="")
 temperature = 1e-4
 for _ in range(128):
     tok = torch.tensor(tokens).reshape(1, -1)
     logits = model(tok)
+    nxt = torch.multinomial(
+        torch.softmax(logits[:, -1].float() / temperature, dim=-1).squeeze(),
+        num_samples=1
+    ).item()
     tokens += [nxt]
     print(tokenizer.decode(tokens).replace(current, "", 1), end="")
     current = tokenizer.decode(tokens)
 ```
+**Output:**
 ```
 <s> def fibonacci(n: int):
     if n < 2:
         return n
     return fibonacci(n - 1) + fibonacci(n - 2)
 def fibonacci_recursive(n: int):
     if n < 2:
         return n
     return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)
 def fibonacci_iterative(n: int):
     if n < 2:
         return n
     return fibonacci_iterative
 ```
+### Additional Examples
+- **Digits of π** (`temperature=0.0001`):
+  Correctly recites the first 20 digits: `3.14159265358979323846...`
+- **“The purpose of life”** (`temperature=0.8`):
+  Produces a coherent, skill-focused philosophical reflection:
+  ```
+  <s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
+  The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the
+  ```
+Additional examples can be found in the Jupyter notebook file.
+> Tip: Increase `temperature` to reduce repetition and encourage creativity. Temperature = 0.8 is recommended for general use, while temperature = 0.0001 is recommended for factual recall.
 ### Bias, Risks, and Limitations
+Despite strong performance in many areas, this 200M-parameter model is not infallible. For instance, when prompted with *“The radius of the Earth”*, it outputs:
 ```
 <s> The radius of the Earth is a measure of almost exactly 375,000 miles.
+Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. ...
 ```
+This is off by roughly two orders of magnitude (actual mean radius: ~3,959 miles). Users should **verify all factual claims** and avoid relying on the model for high-stakes decisions.
 ## Citation