nanochat 760M Backward BASE

Open In Colab

Backward language model trained right-to-left

Model Description

This is a backward language model based on the nanochat architecture, extended with backward/bidirectional training capabilities for research purposes. See the onanchat repository for the full implementation.

Direction: Backward

Right-to-left prediction via token reversal.

This model was trained on reversed token sequences. Instead of predicting the next token given previous context, it predicts the previous token given future context. This enables:

  • Generating text that leads up to a known ending
  • Causal inference in reverse (what came before?)
  • Research into how LLMs learn directional dependencies

Training Phase: BASE

Base pretrained model on FineWeb-Edu dataset.

Model Architecture

Parameter Value
Parameters ~760M
Layers 20
Hidden Size 1280
Attention Heads 10
Vocab Size 65,536
Context Length 2,048

Architecture Details

  • Rotary Position Embeddings (RoPE) - No learned positional embeddings
  • QK Normalization - Stabilizes attention
  • ReLU² Activation - In MLP layers
  • RMSNorm - No learnable parameters
  • Group-Query Attention (GQA) support
  • No bias in linear layers
  • Untied embeddings - Separate input/output embeddings

Quick Start

Try the model instantly in Google Colab: Open In Colab

Usage

import torch
from nanochat.checkpoint_manager import load_model

# Load model (automatically detects direction from checkpoint)
model, tokenizer, meta = load_model("base", device="cuda", model_tag="d20_backward")
direction = meta["direction"]  # "backward"

# For backward models, reverse your input before generation
# and reverse the output after generation

With the nanochat chat interface:

# CLI chat
python -m scripts.chat_cli --source=base --model-tag=d20_backward

# Web interface
python -m scripts.chat_web --source=base --model-tag=d20_backward

Training

Trained using the nanochat framework with:

  • Optimizer: Muon (for transformer layers) + AdamW (for embeddings)
  • Batch Size: 262,144 tokens
  • Learning Rate: 0.02 (matrix), 0.2 (embedding), 0.004 (unembedding)
  • Hardware: 8x H100 GPUs

Research Context

This model is part of a research project studying how LLMs learn when trained in different directions:

  1. Do backward models learn different representations?
  2. Can models transfer knowledge across directions?
  3. Does bidirectional training help both directions?

For more details, see the onanchat repository.

Limitations

  • This is a research model, not intended for production use
  • Small model size (~760M params) limits capabilities
  • Backward models may produce unexpected outputs if direction handling is not properly implemented

License

MIT License - Same as nanochat.

Citation

@misc{nanochat-backward,
  author = {Raghav},
  title = {nanochat 760M Backward},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/raghavt/nanochat-760M-backward}
}

Acknowledgements

Based on nanochat by Andrej Karpathy.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train raghavt/nanochat-760M-backward