nanochat Model
This is a ChatGPT-like model trained using the nanochat pipeline.
Model Description
nanochat is a minimalist but full-featured GPT-style language model trained from scratch, following the complete pipeline from tokenizer training to chat fine-tuning. This model demonstrates that high-quality conversational AI can be trained efficiently with modern techniques.
Training Details
Architecture
- Model Type: GPT decoder-only transformer
- Parameters: ~561M (d20 configuration)
- Context Length: 2048 tokens
- Depth: 20 layers
- Vocabulary Size: 65,536 tokens
Training Infrastructure
- Hardware: 8ร NVIDIA A100-SXM4-80GB (634GB total GPU memory)
- Training Cost: ~$120 (estimated at $14.32/hour)
- Total Training Time: ~8 hours 21 minutes
- Training Tokens: 11.2B tokens
- Tokens:Params Ratio: 20:1
- MFU: 20.82%
- Platform: Linux with CUDA 12.8
- Framework: PyTorch 2.8.0
Training Pipeline
1. Tokenizer Training
- Vocabulary Size: 65,536 tokens
- Special Tokens: 9
- Training Data: 2B characters from FineWeb-Edu
- Compression Ratio: 4.91 bytes/token (vs GPT-2: 4.67, GPT-4: 4.81)
- Training Time: ~1.6 minutes
Tokenizer Performance vs GPT-2/GPT-4:
| Domain | vs GPT-2 | vs GPT-4 |
|---|---|---|
| News | +7.2% | +3.1% |
| Korean | +3.0% | -98.6% |
| Code | +14.4% | -59.5% |
| Math | -3.2% | -16.1% |
| Science | +12.3% | +8.4% |
2. Base Model Training (21,400 iterations)
- Training Loss: 0.8185 bpb (bits per byte)
- Validation Loss: 0.8156 bpb
- CORE Metric: 0.2087
- Batch Size: 524,288 tokens
- Learning Rate: Matrix: 0.02, Embedding: 0.2, Unembedding: 0.004
- Training Time: ~6.6 hours
- Peak Memory: 75.4 GiB
Base Model Benchmark Results:
| Task | Score |
|---|---|
| HellaSwag | 0.2559 |
| Winograd | 0.3040 |
| ARC-Easy | 0.5174 |
| ARC-Challenge | 0.1251 |
| LAMBADA | 0.3775 |
| PIQA | 0.3645 |
| CommonsenseQA | 0.1964 |
| SQuAD | 0.2260 |
3. Midtraining (765 iterations)
- Purpose: Domain adaptation
- Validation Loss: 0.3976 bpb (minimum)
- Training Time: ~30 minutes
4. Chat SFT (651 iterations)
- Training Examples: 20,843 conversations
- Epochs: 1
- Training Loss: 1.1034
- Validation Loss: 1.0189
- Training Time: ~24 minutes
Benchmark Results
Chat Model Performance (After SFT)
| Benchmark | Score |
|---|---|
| ARC-Easy | 0.4571 |
| ARC-Challenge | 0.3430 |
| MMLU | 0.3396 |
| GSM8K | 0.0500 |
| HumanEval | 0.0793 |
| ChatCORE | 0.1298 |
Model Progression
| Stage | CORE/ChatCORE |
|---|---|
| Base Model | 0.2087 |
| Chat SFT | 0.1298 |
Usage
This model requires the nanochat codebase to run. See the nanochat repository for inference instructions.
Quick Start
# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# Download model from HuggingFace
# Run inference
python generate.py --checkpoint path/to/model_000650.pt
Model Files
model_000650.pt: Chat SFT model weights (final checkpoint)meta_000650.json: Model metadata and configurationtokenizer.pkl: Custom BPE tokenizertoken_bytes.pt: Token to byte mapping
Sample Outputs
From the base model (before chat tuning):
The capital of France is Paris. It is the largest city in France and the capital of the country.
The chemical symbol of gold is Au. It is a soft, silvery-white metal that is malleable and ductile.
The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
Training Efficiency
- Cost: ~$120 for full training pipeline
- Time: 8h 21m total wall clock time
- Data Efficiency: 20:1 token-to-parameter ratio
- Code Bloat: 43 files, 8,550 lines, ~88k tokens, 2,004 dependency lines
Limitations
- Small model size (561M parameters) compared to commercial models
- Limited multilingual capabilities (especially for Korean and similar languages)
- Mathematical reasoning needs improvement (GSM8K: 5%)
- Code generation capabilities are basic (HumanEval: 7.93%)
- Trained primarily on English data from FineWeb-Edu
Citation
If you use this model, please cite the nanochat project:
@misc{nanochat2025,
author = {Andrej Karpathy},
title = {nanochat: A minimalist ChatGPT from scratch},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}
License
MIT License - See the nanochat repository for full license details.
Acknowledgments
- Trained on the FineWeb-Edu dataset
- Inspired by the GPT architecture and training methodologies
- Built using PyTorch and modern deep learning best practices
- Training infrastructure provided by cloud GPU providers
Contact & Issues
For issues related to this model or the nanochat framework, please visit the nanochat GitHub repository.