nanochat Model

This is a ChatGPT-like model trained using the nanochat pipeline.

Model Description

nanochat is a minimalist but full-featured GPT-style language model trained from scratch, following the complete pipeline from tokenizer training to chat fine-tuning. This model demonstrates that high-quality conversational AI can be trained efficiently with modern techniques.

Training Details

Architecture

Model Type: GPT decoder-only transformer
Parameters: ~561M (d20 configuration)
Context Length: 2048 tokens
Depth: 20 layers
Vocabulary Size: 65,536 tokens

Training Infrastructure

Hardware: 8× NVIDIA A100-SXM4-80GB (634GB total GPU memory)
Training Cost: ~$120 (estimated at $14.32/hour)
Total Training Time: ~8 hours 21 minutes
Training Tokens: 11.2B tokens
Tokens:Params Ratio: 20:1
MFU: 20.82%
Platform: Linux with CUDA 12.8
Framework: PyTorch 2.8.0

Training Pipeline

1. Tokenizer Training

Vocabulary Size: 65,536 tokens
Special Tokens: 9
Training Data: 2B characters from FineWeb-Edu
Compression Ratio: 4.91 bytes/token (vs GPT-2: 4.67, GPT-4: 4.81)
Training Time: ~1.6 minutes

Tokenizer Performance vs GPT-2/GPT-4:

Domain	vs GPT-2	vs GPT-4
News	+7.2%	+3.1%
Korean	+3.0%	-98.6%
Code	+14.4%	-59.5%
Math	-3.2%	-16.1%
Science	+12.3%	+8.4%

2. Base Model Training (21,400 iterations)

Training Loss: 0.8185 bpb (bits per byte)
Validation Loss: 0.8156 bpb
CORE Metric: 0.2087
Batch Size: 524,288 tokens
Learning Rate: Matrix: 0.02, Embedding: 0.2, Unembedding: 0.004
Training Time: ~6.6 hours
Peak Memory: 75.4 GiB

Base Model Benchmark Results:

Task	Score
HellaSwag	0.2559
Winograd	0.3040
ARC-Easy	0.5174
ARC-Challenge	0.1251
LAMBADA	0.3775
PIQA	0.3645
CommonsenseQA	0.1964
SQuAD	0.2260

3. Midtraining (765 iterations)

Purpose: Domain adaptation
Validation Loss: 0.3976 bpb (minimum)
Training Time: ~30 minutes

4. Chat SFT (651 iterations)

Training Examples: 20,843 conversations
Epochs: 1
Training Loss: 1.1034
Validation Loss: 1.0189
Training Time: ~24 minutes

Benchmark Results

Chat Model Performance (After SFT)

Benchmark	Score
ARC-Easy	0.4571
ARC-Challenge	0.3430
MMLU	0.3396
GSM8K	0.0500
HumanEval	0.0793
ChatCORE	0.1298

Model Progression

Stage	CORE/ChatCORE
Base Model	0.2087
Chat SFT	0.1298

Usage

This model requires the nanochat codebase to run. See the nanochat repository for inference instructions.

Quick Start

# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# Download model from HuggingFace

# Run inference
python generate.py --checkpoint path/to/model_000650.pt

Model Files

model_000650.pt: Chat SFT model weights (final checkpoint)
meta_000650.json: Model metadata and configuration
tokenizer.pkl: Custom BPE tokenizer
token_bytes.pt: Token to byte mapping

Sample Outputs

From the base model (before chat tuning):

The capital of France is Paris. It is the largest city in France and the capital of the country.

The chemical symbol of gold is Au. It is a soft, silvery-white metal that is malleable and ductile.

The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,

Training Efficiency

Cost: ~$120 for full training pipeline
Time: 8h 21m total wall clock time
Data Efficiency: 20:1 token-to-parameter ratio
Code Bloat: 43 files, 8,550 lines, ~88k tokens, 2,004 dependency lines

Limitations

Small model size (561M parameters) compared to commercial models
Limited multilingual capabilities (especially for Korean and similar languages)
Mathematical reasoning needs improvement (GSM8K: 5%)
Code generation capabilities are basic (HumanEval: 7.93%)
Trained primarily on English data from FineWeb-Edu

Citation

If you use this model, please cite the nanochat project:

@misc{nanochat2025,
  author = {Andrej Karpathy},
  title = {nanochat: A minimalist ChatGPT from scratch},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/karpathy/nanochat}
}

License

MIT License - See the nanochat repository for full license details.

Acknowledgments

Trained on the FineWeb-Edu dataset
Inspired by the GPT architecture and training methodologies
Built using PyTorch and modern deep learning best practices
Training infrastructure provided by cloud GPU providers

Contact & Issues

For issues related to this model or the nanochat framework, please visit the nanochat GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

BrianGuo
/

nanochat-d20-chat