Viharikvs commited on
Commit
f32f1dc
·
verified ·
1 Parent(s): 361ee09

Model card updated after epoch 0

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: t5-small
3
+ license: apache-2.0
4
+ datasets:
5
+ - open-web-math/open-web-math
6
+ tags:
7
+ - text-generation
8
+ - causal-lm
9
+ - mamba
10
+ - hrm
11
+ - pytorch
12
+ language:
13
+ - en
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # CMBA-768M-OpenWebMath
18
+
19
+ A 768M parameter Hierarchical Recurrent Memory (HRM) language model trained on high-quality math web text from OpenWebMath. This model uses **Mamba2 state-space models** instead of traditional attention mechanisms, enabling efficient long-range sequence modeling.
20
+
21
+ ## Model Architecture
22
+
23
+ **CMBA** (Causal Mamba-based Architecture) implements a hierarchical processing structure:
24
+
25
+ - **Hierarchical Design**: Dual-level processing with H-layers (high-level abstraction) and L-layers (low-level specialists)
26
+ - **Mamba2 Mixers**: State-space models replace attention for O(n) complexity vs O(n²)
27
+ - **Adaptive Computation**: Halting mechanism allows variable compute per token (ACT-style pondering)
28
+ - **Parameters**: ~768M total
29
+ - **Context Length**: 1024 tokens
30
+
31
+ ### Configuration
32
+ ```python
33
+ Model Dimensions:
34
+ - d_model: 768
35
+ - n_heads: 12 (for compatibility, not used in Mamba)
36
+ - d_ff: 3072
37
+ - H_layers: 12 (high-level hierarchy)
38
+ - L_layers: 12 (low-level processing)
39
+
40
+ Mamba2 Settings:
41
+ - d_state: 128
42
+ - expand: 2
43
+ - headdim: 64
44
+ - d_conv: 4
45
+ - ngroups: 1
46
+
47
+ Training:
48
+ - Max halt steps: 8
49
+ - Block size: 1024
50
+ - Batch size: 32 (effective)
51
+ - Learning rate: 0.0002 → 1e-06
52
+ - Weight decay: 0.1
53
+ ```
54
+
55
+ ## Training Data
56
+
57
+ - **Dataset**: [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)
58
+ - **Tokenizer**: `t5-small` (T5 SentencePiece)
59
+ - **Vocab Size**: 32100
60
+
61
+ ## Latest Performance (Epoch 0)
62
+
63
+ - **Validation Loss**: `8.5339`
64
+ - **Validation Perplexity**: `5084.00`
65
+
66
+ ## Usage
67
+
68
+ ```python
69
+ from transformers import T5Tokenizer
70
+ from hrm_text1_modeling import HRMText1
71
+
72
+ tokenizer = T5Tokenizer.from_pretrained("t5-small")
73
+ model = HRMText1.from_pretrained("Viharikvs/CMBA-768M-OpenWebMath")
74
+
75
+ # Generate text
76
+ input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids
77
+ outputs = model.generate(input_ids, max_length=100)
78
+ print(tokenizer.decode(outputs[0]))
79
+ ```
80
+
81
+ ## Citation
82
+
83
+ If you use this model, please cite:
84
+
85
+ ```bibtex
86
+ @misc{cmba-768m-openwebmath,
87
+ author = {Vihari},
88
+ title = {CMBA-768M-OpenWebMath: Hierarchical Mamba-based Language Model},
89
+ year = {2025},
90
+ publisher = {HuggingFace},
91
+ url = {https://huggingface.co/Viharikvs/CMBA-768M-OpenWebMath}
92
+ }
93
+ ```
94
+
95
+ ## License
96
+
97
+ Apache 2.0