nanochat 760M Backward BASE
Backward language model trained right-to-left
Model Description
This is a backward language model based on the nanochat architecture, extended with backward/bidirectional training capabilities for research purposes. See the onanchat repository for the full implementation.
Direction: Backward
Right-to-left prediction via token reversal.
This model was trained on reversed token sequences. Instead of predicting the next token given previous context, it predicts the previous token given future context. This enables:
- Generating text that leads up to a known ending
- Causal inference in reverse (what came before?)
- Research into how LLMs learn directional dependencies
Training Phase: BASE
Base pretrained model on FineWeb-Edu dataset.
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | ~760M |
| Layers | 20 |
| Hidden Size | 1280 |
| Attention Heads | 10 |
| Vocab Size | 65,536 |
| Context Length | 2,048 |
Architecture Details
- Rotary Position Embeddings (RoPE) - No learned positional embeddings
- QK Normalization - Stabilizes attention
- ReLU² Activation - In MLP layers
- RMSNorm - No learnable parameters
- Group-Query Attention (GQA) support
- No bias in linear layers
- Untied embeddings - Separate input/output embeddings
Quick Start
Try the model instantly in Google Colab:
Usage
import torch
from nanochat.checkpoint_manager import load_model
# Load model (automatically detects direction from checkpoint)
model, tokenizer, meta = load_model("base", device="cuda", model_tag="d20_backward")
direction = meta["direction"] # "backward"
# For backward models, reverse your input before generation
# and reverse the output after generation
With the nanochat chat interface:
# CLI chat
python -m scripts.chat_cli --source=base --model-tag=d20_backward
# Web interface
python -m scripts.chat_web --source=base --model-tag=d20_backward
Training
Trained using the nanochat framework with:
- Optimizer: Muon (for transformer layers) + AdamW (for embeddings)
- Batch Size: 262,144 tokens
- Learning Rate: 0.02 (matrix), 0.2 (embedding), 0.004 (unembedding)
- Hardware: 8x H100 GPUs
Research Context
This model is part of a research project studying how LLMs learn when trained in different directions:
- Do backward models learn different representations?
- Can models transfer knowledge across directions?
- Does bidirectional training help both directions?
For more details, see the onanchat repository.
Limitations
- This is a research model, not intended for production use
- Small model size (~760M params) limits capabilities
- Backward models may produce unexpected outputs if direction handling is not properly implemented
License
MIT License - Same as nanochat.
Citation
@misc{nanochat-backward,
author = {Raghav},
title = {nanochat 760M Backward},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/raghavt/nanochat-760M-backward}
}
Acknowledgements
Based on nanochat by Andrej Karpathy.