PolyPythias

This model is part of the PolyPythias suite, an extension of the Pythia project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models.

Paper

PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs

Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. ICLR 2025.

Model Details

Size	Parameters	Layers	Model Dim	Heads	Original Model
14M	14M	6	128	4	pythia-14m
31M	31M	6	256	8	pythia-31m
70M	70M	6	512	8	pythia-70m
160M	160M	12	768	12	pythia-160m
410M	410M	24	1024	16	pythia-410m

All models were trained on 300B tokens from The Pile.

Naming Convention

pythia-{size}m - Original Pythia model (seed 1234)
pythia-{size}m-seed{1-9} - PolyPythias variants with different random seeds
pythia-160m-data-seed{1-3} - 160M models with only data ordering varied (weight init fixed)
pythia-160m-weight-seed{1-3} - 160M models with only weight initialization varied (data order fixed)

The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization.

Quick Start

from transformers import GPTNeoXForCausalLM, AutoTokenizer

# Load the final checkpoint
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3")

# Generate text
inputs = tokenizer("The quick brown fox", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Available Checkpoints

Each model provides 154 intermediate checkpoints saved as Git branches:

Checkpoint	Training Tokens	Description
`step0`	0	Initialization (before training)
`step1`, `step2`, `step4`, ..., `step512`	2M - 1B	10 log-spaced early checkpoints
`step1000`, `step2000`, ..., `step143000`	2B - 300B	143 evenly-spaced checkpoints

To load a specific checkpoint:

model = GPTNeoXForCausalLM.from_pretrained(
    "EleutherAI/pythia-70m-seed3",
    revision="step50000",  # Any checkpoint step
)

Training Data

All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at:

EleutherAI/pile-preshuffled-seeds

This dataset contains .idx files for seeds 0-9 used with MMapIndexedDataset to load the memory-mapped Pile data in the correct order for each seed.

Reproducing Training Data Order

To reproduce the exact data ordering used for a specific seed:

Download the Pile dataset and tokenize it using the Pythia tokenizer

Download the corresponding seed folder from pile-preshuffled-seeds:

# Using huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="EleutherAI/pile-preshuffled-seeds",
    repo_type="dataset",
    allow_patterns="seed3/*",  # Download only seed3
    local_dir="./pile-seeds"
)

Use the idx files with GPT-NeoX's MMapIndexedDataset:

from dataset import MMapIndexedDataset
dataset = MMapIndexedDataset(path_prefix, skip_warmup=True)

For complete training reproduction instructions, see the Pythia GitHub repository.

All PolyPythias Models

The complete collection is available at: EleutherAI/polypythias

Evaluation Results

Evaluation results for all models are available in the polypythias-evals dataset.

Limitations

These models are released for research purposes only. They are not intended for deployment in production systems.

Not instruction-tuned: These are base language models that predict the next token; they will not follow instructions like ChatGPT
May generate harmful content: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content
English only: Models were trained primarily on English text
No safety filtering: Outputs are not filtered for safety or accuracy

License

Apache 2.0

Contact

For questions about these models, please use:

EleutherAI Discord - #release-discussion channel
GitHub Issues

Citation

If you use these models, please cite:

@inproceedings{vanderwal2025polypythias,
    title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs},
    author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella},
    booktitle={International Conference on Learning Representations},
    year={2025},
    url={https://arxiv.org/abs/2503.09543}
}

Downloads last month: 105

Datasets used to train EleutherAI/pythia-410m-seed3

Collection including EleutherAI/pythia-410m-seed3

PolyPythias

Collection

59 items • Updated Apr 9, 2025

Paper for EleutherAI/pythia-410m-seed3

PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs

Paper • 2503.09543 • Published Mar 12, 2025

PolyPythias

Paper

Model Details

Naming Convention

Quick Start

Available Checkpoints

Training Data

Reproducing Training Data Order

All PolyPythias Models

14M Parameter Models

31M Parameter Models

70M Parameter Models

160M Parameter Models

410M Parameter Models

Evaluation Results

Limitations

License

Contact

Citation

Datasets used to train EleutherAI/pythia-410m-seed3

Collection including EleutherAI/pythia-410m-seed3

Paper for EleutherAI/pythia-410m-seed3