nanochat 760M Backward BASE

Backward language model trained right-to-left

Model Description

This is a backward language model based on the nanochat architecture, extended with backward/bidirectional training capabilities for research purposes. See the onanchat repository for the full implementation.

Direction: Backward

Right-to-left prediction via token reversal.

This model was trained on reversed token sequences. Instead of predicting the next token given previous context, it predicts the previous token given future context. This enables:

Generating text that leads up to a known ending
Causal inference in reverse (what came before?)
Research into how LLMs learn directional dependencies

Training Phase: BASE

Base pretrained model on FineWeb-Edu dataset.

Model Architecture

Parameter	Value
Parameters	~760M
Layers	20
Hidden Size	1280
Attention Heads	10
Vocab Size	65,536
Context Length	2,048

Architecture Details

Rotary Position Embeddings (RoPE) - No learned positional embeddings
QK Normalization - Stabilizes attention
ReLU² Activation - In MLP layers
RMSNorm - No learnable parameters
Group-Query Attention (GQA) support
No bias in linear layers
Untied embeddings - Separate input/output embeddings

Quick Start

Try the model instantly in Google Colab:

Usage

import torch
from nanochat.checkpoint_manager import load_model

# Load model (automatically detects direction from checkpoint)
model, tokenizer, meta = load_model("base", device="cuda", model_tag="d20_backward")
direction = meta["direction"]  # "backward"

# For backward models, reverse your input before generation
# and reverse the output after generation

With the nanochat chat interface:

# CLI chat
python -m scripts.chat_cli --source=base --model-tag=d20_backward

# Web interface
python -m scripts.chat_web --source=base --model-tag=d20_backward

Training

Trained using the nanochat framework with:

Optimizer: Muon (for transformer layers) + AdamW (for embeddings)
Batch Size: 262,144 tokens
Learning Rate: 0.02 (matrix), 0.2 (embedding), 0.004 (unembedding)
Hardware: 8x H100 GPUs

Research Context

This model is part of a research project studying how LLMs learn when trained in different directions:

Do backward models learn different representations?
Can models transfer knowledge across directions?
Does bidirectional training help both directions?

For more details, see the onanchat repository.

Limitations

This is a research model, not intended for production use
Small model size (~760M params) limits capabilities
Backward models may produce unexpected outputs if direction handling is not properly implemented

License

MIT License - Same as nanochat.

Citation

@misc{nanochat-backward,
  author = {Raghav},
  title = {nanochat 760M Backward},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/raghavt/nanochat-760M-backward}
}

Acknowledgements

Based on nanochat by Andrej Karpathy.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

raghavt
/

nanochat-760M-backward