File size: 4,554 Bytes
1501752
0185573
86b5931
0185573
1501752
0185573
 
 
 
 
1501752
 
0185573
1501752
86b5931
1501752
0185573
1501752
0185573
1501752
86b5931
 
 
 
0185573
86b5931
 
 
0185573
 
 
 
 
86b5931
 
 
 
 
0185573
86b5931
 
 
 
0185573
86b5931
 
 
 
 
0185573
86b5931
0185573
86b5931
 
 
 
0185573
86b5931
 
 
 
 
0185573
86b5931
0185573
86b5931
0185573
86b5931
0185573
86b5931
 
 
 
 
 
 
0185573
6efd0c1
0185573
86b5931
0185573
86b5931
0185573
86b5931
 
 
 
 
0185573
86b5931
0185573
86b5931
0185573
86b5931
 
 
 
 
0185573
86b5931
0185573
86b5931
0185573
86b5931
0185573
 
 
 
 
 
86b5931
 
 
 
 
0185573
86b5931
71ca055
0185573
 
 
71ca055
0185573
 
 
86b5931
 
 
0185573
 
86b5931
0185573
 
 
 
 
a128a26
0185573
 
 
 
 
86b5931
1501752
86b5931
1501752
86b5931
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
language: en
license: apache-2.0
library_name: pytorch
tags:
- transformer
- gpt
- language-model
- from-scratch
- educational
---

# Model Card for LumenBase

A 128M parameter GPT-style transformer built from scratch for educational purposes, featuring Grouped Multi-Query Attention (GQA), SwiGLU, RMSNorm, and RoPE.

## Model Details

### Model Description

LumenBase is a decoder-only transformer language model implementing modern architectural optimizations:
- **Architecture**: 12-layer transformer with GQA (12 query heads, 4 KV heads), SwiGLU activation, RMSNorm, and RoPE
- **Parameters**: 128M (768 hidden size, 3072 FFN, 2048 context length)
- **Training**: Mixed precision (FP16/BF16) with custom tokenizer (32K vocab)

- **Developed by:** Hariom Jangra
- **Model type:** Decoder-only Transformer
- **Language:** English
- **License:** MIT
- **Repository:** https://github.com/HariomJangra/project-lumen

## Uses

**Direct Use:**
- Text generation and completion
- Educational resource for understanding transformer architecture
- Research baseline for language models
- Foundation for fine-tuning on specific tasks

**Downstream Use:**
- Instruction tuning
- Chat applications
- Domain-specific fine-tuning

**Out-of-Scope:**
- Production deployments
- Safety-critical applications
- Applications requiring factual accuracy without verification
- This is an educational model - use established frameworks for production

## Limitations

**Technical:**
- Limited size (128M parameters) - below state-of-the-art performance
- 2048 token context window
- May generate incoherent text for complex prompts

**Bias & Safety:**
- May perpetuate training data biases
- Not evaluated for fairness across demographics
- Can generate inappropriate content
- Should not be relied upon for factual information

**Recommendations:** This is an educational model. Verify all outputs, implement content filtering for applications, and use production-ready models for commercial use.

## Training

**Data:** Custom datasets tokenized with BPE (32K vocab)

**Hyperparameters:**
- Optimizer: AdamW (lr=3e-4, weight_decay=0.1)
- Batch: 12 × 4 gradient accumulation = 48 effective
- Sequence length: 2048 tokens
- Scheduler: Linear warmup + Cosine annealing
- Precision: Mixed (FP16/BF16/FP32)
- Dropout: 0.1 (training), 0.0 (inference)

![Training Loss](training_loss_curve.png)

## Evaluation

Evaluated on standard NLP benchmarks:

| Benchmark | Accuracy | Correct/Total |
|-----------|----------|---------------|
| **ARC-Easy** | 39.48% | 938/2,376 |
| **ARC-Challenge** | 23.55% | 276/1,172 |
| **HellaSwag** | 32.62% | 334/1,024 |

**Summary:** Baseline performance consistent with a 128M educational model. Results show capability on easier tasks with room for improvement on complex reasoning.

## Technical Specifications

**Architecture:** Decoder-only Transformer
- 12 layers, 768 hidden size, 12 attention heads (4 KV heads)
- SwiGLU FFN (3072 intermediate), RMSNorm, RoPE
- 32K vocab, 2048 max sequence length
- Weight tying between embedding and output layers

**Implementation:** Custom PyTorch implementation from scratch

**Software:** Python 3.13, PyTorch, NumPy, Tokenizers, tqdm, matplotlib

## How to Use

```python
import torch
from ModelArchitecture import Transformer, ModelConfig, generate
from tokenizers import Tokenizer

# Load configuration and model
config = ModelConfig(vocab_size=32000, hidden_size=768, n_heads=12, 
                     n_kv_heads=4, n_kv_groups=3, head_dim=64, n_layers=12,
                     intermediate_size=3072, max_position_embeddings=2048,
                     dropout=0.0, pre_norm=True, tie_weights=True)

model = Transformer(config)
model.load_state_dict(torch.load('model.safetensors'))
model.eval()

# Generate text
tokenizer = Tokenizer.from_file('tokenizer.json')
prompt = "Once upon a time"
input_ids = torch.tensor([tokenizer.encode(prompt).ids])

output = generate(model, input_ids, max_new_tokens=100, 
                 temperature=0.8, top_k=50, top_p=0.9)
print(tokenizer.decode(output[0].tolist()))
```

## Citation

```bibtex
@misc{lumenbase2024,
  author = {Jangra, Hariom},
  title = {LumenBase: A 128M Parameter Language Model Built from Scratch},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/HariomJangra/project-lumen}}
}
```

## Contact

**Author:** Hariom Jangra ([@HariomJangra](https://github.com/HariomJangra))

For questions or feedback, please open an issue on the [GitHub repository](https://github.com/HariomJangra/project-lumen).