Viharikvs
/

CMBA-768M-OpenWebMath

+---
+base_model: t5-small
+license: apache-2.0
+datasets:
+- open-web-math/open-web-math
+tags:
+- text-generation
+- causal-lm
+- mamba
+- hrm
+- pytorch
+language:
+- en
+pipeline_tag: text-generation
+---
+# CMBA-768M-OpenWebMath
+A 768M parameter Hierarchical Recurrent Memory (HRM) language model trained on high-quality math web text from OpenWebMath. This model uses **Mamba2 state-space models** instead of traditional attention mechanisms, enabling efficient long-range sequence modeling.
+## Model Architecture
+**CMBA** (Causal Mamba-based Architecture) implements a hierarchical processing structure:
+- **Hierarchical Design**: Dual-level processing with H-layers (high-level abstraction) and L-layers (low-level specialists)
+- **Mamba2 Mixers**: State-space models replace attention for O(n) complexity vs O(n²)
+- **Adaptive Computation**: Halting mechanism allows variable compute per token (ACT-style pondering)
+- **Parameters**: ~768M total
+- **Context Length**: 1024 tokens
+### Configuration
+```python
+Model Dimensions:
+  - d_model: 768
+  - n_heads: 12 (for compatibility, not used in Mamba)
+  - d_ff: 3072
+  - H_layers: 12 (high-level hierarchy)
+  - L_layers: 12 (low-level processing)
+Mamba2 Settings:
+  - d_state: 128
+  - expand: 2
+  - headdim: 64
+  - d_conv: 4
+  - ngroups: 1
+Training:
+  - Max halt steps: 8
+  - Block size: 1024
+  - Batch size: 32 (effective)
+  - Learning rate: 0.0002 → 1e-06
+  - Weight decay: 0.1
+```
+## Training Data
+- **Dataset**: [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math)
+- **Tokenizer**: `t5-small` (T5 SentencePiece)
+- **Vocab Size**: 32100
+## Latest Performance (Epoch 0)
+- **Validation Loss**: `8.5339`
+- **Validation Perplexity**: `5084.00`
+## Usage
+```python
+from transformers import T5Tokenizer
+from hrm_text1_modeling import HRMText1
+tokenizer = T5Tokenizer.from_pretrained("t5-small")
+model = HRMText1.from_pretrained("Viharikvs/CMBA-768M-OpenWebMath")
+# Generate text
+input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids
+outputs = model.generate(input_ids, max_length=100)
+print(tokenizer.decode(outputs[0]))
+```
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{cmba-768m-openwebmath,
+  author = {Vihari},
+  title = {CMBA-768M-OpenWebMath: Hierarchical Mamba-based Language Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/Viharikvs/CMBA-768M-OpenWebMath}
+}
+```
+## License
+Apache 2.0