nanochat Model

This is a ChatGPT-like model trained using the nanochat pipeline.

Model Description

nanochat is a minimalist but full-featured GPT-style language model trained from scratch, following the complete pipeline from tokenizer training to chat fine-tuning. This model demonstrates that high-quality conversational AI can be trained efficiently with modern techniques.

Training Details

Architecture

  • Model Type: GPT decoder-only transformer
  • Parameters: ~561M (d20 configuration)
  • Context Length: 2048 tokens
  • Depth: 20 layers
  • Vocabulary Size: 65,536 tokens

Training Infrastructure

  • Hardware: 8ร— NVIDIA A100-SXM4-80GB (634GB total GPU memory)
  • Training Cost: ~$120 (estimated at $14.32/hour)
  • Total Training Time: ~8 hours 21 minutes
  • Training Tokens: 11.2B tokens
  • Tokens:Params Ratio: 20:1
  • MFU: 20.82%
  • Platform: Linux with CUDA 12.8
  • Framework: PyTorch 2.8.0

Training Pipeline

1. Tokenizer Training

  • Vocabulary Size: 65,536 tokens
  • Special Tokens: 9
  • Training Data: 2B characters from FineWeb-Edu
  • Compression Ratio: 4.91 bytes/token (vs GPT-2: 4.67, GPT-4: 4.81)
  • Training Time: ~1.6 minutes

Tokenizer Performance vs GPT-2/GPT-4:

Domain vs GPT-2 vs GPT-4
News +7.2% +3.1%
Korean +3.0% -98.6%
Code +14.4% -59.5%
Math -3.2% -16.1%
Science +12.3% +8.4%

2. Base Model Training (21,400 iterations)

  • Training Loss: 0.8185 bpb (bits per byte)
  • Validation Loss: 0.8156 bpb
  • CORE Metric: 0.2087
  • Batch Size: 524,288 tokens
  • Learning Rate: Matrix: 0.02, Embedding: 0.2, Unembedding: 0.004
  • Training Time: ~6.6 hours
  • Peak Memory: 75.4 GiB

Base Model Benchmark Results:

Task Score
HellaSwag 0.2559
Winograd 0.3040
ARC-Easy 0.5174
ARC-Challenge 0.1251
LAMBADA 0.3775
PIQA 0.3645
CommonsenseQA 0.1964
SQuAD 0.2260

3. Midtraining (765 iterations)

  • Purpose: Domain adaptation
  • Validation Loss: 0.3976 bpb (minimum)
  • Training Time: ~30 minutes

4. Chat SFT (651 iterations)

  • Training Examples: 20,843 conversations
  • Epochs: 1
  • Training Loss: 1.1034
  • Validation Loss: 1.0189
  • Training Time: ~24 minutes

Benchmark Results

Chat Model Performance (After SFT)

Benchmark Score
ARC-Easy 0.4571
ARC-Challenge 0.3430
MMLU 0.3396
GSM8K 0.0500
HumanEval 0.0793
ChatCORE 0.1298

Model Progression

Stage CORE/ChatCORE
Base Model 0.2087
Chat SFT 0.1298

Usage

This model requires the nanochat codebase to run. See the nanochat repository for inference instructions.

Quick Start

# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# Download model from HuggingFace

# Run inference
python generate.py --checkpoint path/to/model_000650.pt

Model Files

  • model_000650.pt: Chat SFT model weights (final checkpoint)
  • meta_000650.json: Model metadata and configuration
  • tokenizer.pkl: Custom BPE tokenizer
  • token_bytes.pt: Token to byte mapping

Sample Outputs

From the base model (before chat tuning):

The capital of France is Paris. It is the largest city in France and the capital of the country.

The chemical symbol of gold is Au. It is a soft, silvery-white metal that is malleable and ductile.

The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,

Training Efficiency

  • Cost: ~$120 for full training pipeline
  • Time: 8h 21m total wall clock time
  • Data Efficiency: 20:1 token-to-parameter ratio
  • Code Bloat: 43 files, 8,550 lines, ~88k tokens, 2,004 dependency lines

Limitations

  • Small model size (561M parameters) compared to commercial models
  • Limited multilingual capabilities (especially for Korean and similar languages)
  • Mathematical reasoning needs improvement (GSM8K: 5%)
  • Code generation capabilities are basic (HumanEval: 7.93%)
  • Trained primarily on English data from FineWeb-Edu

Citation

If you use this model, please cite the nanochat project:

@misc{nanochat2025,
  author = {Andrej Karpathy},
  title = {nanochat: A minimalist ChatGPT from scratch},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/karpathy/nanochat}
}

License

MIT License - See the nanochat repository for full license details.

Acknowledgments

  • Trained on the FineWeb-Edu dataset
  • Inspired by the GPT architecture and training methodologies
  • Built using PyTorch and modern deep learning best practices
  • Training infrastructure provided by cloud GPU providers

Contact & Issues

For issues related to this model or the nanochat framework, please visit the nanochat GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using BrianGuo/nanochat-d20-chat 1