File size: 4,554 Bytes
1501752 0185573 86b5931 0185573 1501752 0185573 1501752 0185573 1501752 86b5931 1501752 0185573 1501752 0185573 1501752 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 6efd0c1 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 0185573 86b5931 71ca055 0185573 71ca055 0185573 86b5931 0185573 86b5931 0185573 a128a26 0185573 86b5931 1501752 86b5931 1501752 86b5931 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
language: en
license: apache-2.0
library_name: pytorch
tags:
- transformer
- gpt
- language-model
- from-scratch
- educational
---
# Model Card for LumenBase
A 128M parameter GPT-style transformer built from scratch for educational purposes, featuring Grouped Multi-Query Attention (GQA), SwiGLU, RMSNorm, and RoPE.
## Model Details
### Model Description
LumenBase is a decoder-only transformer language model implementing modern architectural optimizations:
- **Architecture**: 12-layer transformer with GQA (12 query heads, 4 KV heads), SwiGLU activation, RMSNorm, and RoPE
- **Parameters**: 128M (768 hidden size, 3072 FFN, 2048 context length)
- **Training**: Mixed precision (FP16/BF16) with custom tokenizer (32K vocab)
- **Developed by:** Hariom Jangra
- **Model type:** Decoder-only Transformer
- **Language:** English
- **License:** MIT
- **Repository:** https://github.com/HariomJangra/project-lumen
## Uses
**Direct Use:**
- Text generation and completion
- Educational resource for understanding transformer architecture
- Research baseline for language models
- Foundation for fine-tuning on specific tasks
**Downstream Use:**
- Instruction tuning
- Chat applications
- Domain-specific fine-tuning
**Out-of-Scope:**
- Production deployments
- Safety-critical applications
- Applications requiring factual accuracy without verification
- This is an educational model - use established frameworks for production
## Limitations
**Technical:**
- Limited size (128M parameters) - below state-of-the-art performance
- 2048 token context window
- May generate incoherent text for complex prompts
**Bias & Safety:**
- May perpetuate training data biases
- Not evaluated for fairness across demographics
- Can generate inappropriate content
- Should not be relied upon for factual information
**Recommendations:** This is an educational model. Verify all outputs, implement content filtering for applications, and use production-ready models for commercial use.
## Training
**Data:** Custom datasets tokenized with BPE (32K vocab)
**Hyperparameters:**
- Optimizer: AdamW (lr=3e-4, weight_decay=0.1)
- Batch: 12 × 4 gradient accumulation = 48 effective
- Sequence length: 2048 tokens
- Scheduler: Linear warmup + Cosine annealing
- Precision: Mixed (FP16/BF16/FP32)
- Dropout: 0.1 (training), 0.0 (inference)

## Evaluation
Evaluated on standard NLP benchmarks:
| Benchmark | Accuracy | Correct/Total |
|-----------|----------|---------------|
| **ARC-Easy** | 39.48% | 938/2,376 |
| **ARC-Challenge** | 23.55% | 276/1,172 |
| **HellaSwag** | 32.62% | 334/1,024 |
**Summary:** Baseline performance consistent with a 128M educational model. Results show capability on easier tasks with room for improvement on complex reasoning.
## Technical Specifications
**Architecture:** Decoder-only Transformer
- 12 layers, 768 hidden size, 12 attention heads (4 KV heads)
- SwiGLU FFN (3072 intermediate), RMSNorm, RoPE
- 32K vocab, 2048 max sequence length
- Weight tying between embedding and output layers
**Implementation:** Custom PyTorch implementation from scratch
**Software:** Python 3.13, PyTorch, NumPy, Tokenizers, tqdm, matplotlib
## How to Use
```python
import torch
from ModelArchitecture import Transformer, ModelConfig, generate
from tokenizers import Tokenizer
# Load configuration and model
config = ModelConfig(vocab_size=32000, hidden_size=768, n_heads=12,
n_kv_heads=4, n_kv_groups=3, head_dim=64, n_layers=12,
intermediate_size=3072, max_position_embeddings=2048,
dropout=0.0, pre_norm=True, tie_weights=True)
model = Transformer(config)
model.load_state_dict(torch.load('model.safetensors'))
model.eval()
# Generate text
tokenizer = Tokenizer.from_file('tokenizer.json')
prompt = "Once upon a time"
input_ids = torch.tensor([tokenizer.encode(prompt).ids])
output = generate(model, input_ids, max_new_tokens=100,
temperature=0.8, top_k=50, top_p=0.9)
print(tokenizer.decode(output[0].tolist()))
```
## Citation
```bibtex
@misc{lumenbase2024,
author = {Jangra, Hariom},
title = {LumenBase: A 128M Parameter Language Model Built from Scratch},
year = {2025},
publisher = {GitHub},
howpublished = {\url{https://github.com/HariomJangra/project-lumen}}
}
```
## Contact
**Author:** Hariom Jangra ([@HariomJangra](https://github.com/HariomJangra))
For questions or feedback, please open an issue on the [GitHub repository](https://github.com/HariomJangra/project-lumen). |