PolyPythias

This model is part of the PolyPythias suite, an extension of the Pythia project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models.

Paper

PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs

Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. ICLR 2025.

Model Details

Size Parameters Layers Model Dim Heads Original Model
14M 14M 6 128 4 pythia-14m
31M 31M 6 256 8 pythia-31m
70M 70M 6 512 8 pythia-70m
160M 160M 12 768 12 pythia-160m
410M 410M 24 1024 16 pythia-410m

All models were trained on 300B tokens from The Pile.

Naming Convention

  • pythia-{size}m - Original Pythia model (seed 1234)
  • pythia-{size}m-seed{1-9} - PolyPythias variants with different random seeds
  • pythia-160m-data-seed{1-3} - 160M models with only data ordering varied (weight init fixed)
  • pythia-160m-weight-seed{1-3} - 160M models with only weight initialization varied (data order fixed)

The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization.

Quick Start

from transformers import GPTNeoXForCausalLM, AutoTokenizer

# Load the final checkpoint
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3")

# Generate text
inputs = tokenizer("The quick brown fox", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Available Checkpoints

Each model provides 154 intermediate checkpoints saved as Git branches:

Checkpoint Training Tokens Description
step0 0 Initialization (before training)
step1, step2, step4, ..., step512 2M - 1B 10 log-spaced early checkpoints
step1000, step2000, ..., step143000 2B - 300B 143 evenly-spaced checkpoints

To load a specific checkpoint:

model = GPTNeoXForCausalLM.from_pretrained(
    "EleutherAI/pythia-70m-seed3",
    revision="step50000",  # Any checkpoint step
)

Training Data

All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at:

EleutherAI/pile-preshuffled-seeds

This dataset contains .idx files for seeds 0-9 used with MMapIndexedDataset to load the memory-mapped Pile data in the correct order for each seed.

Reproducing Training Data Order

To reproduce the exact data ordering used for a specific seed:

  1. Download the Pile dataset and tokenize it using the Pythia tokenizer
  2. Download the corresponding seed folder from pile-preshuffled-seeds:
    # Using huggingface_hub
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id="EleutherAI/pile-preshuffled-seeds",
        repo_type="dataset",
        allow_patterns="seed3/*",  # Download only seed3
        local_dir="./pile-seeds"
    )
    
  3. Use the idx files with GPT-NeoX's MMapIndexedDataset:
    from dataset import MMapIndexedDataset
    dataset = MMapIndexedDataset(path_prefix, skip_warmup=True)
    

For complete training reproduction instructions, see the Pythia GitHub repository.

All PolyPythias Models

The complete collection is available at: EleutherAI/polypythias

14M Parameter Models

31M Parameter Models

70M Parameter Models

160M Parameter Models

410M Parameter Models

Evaluation Results

Evaluation results for all models are available in the polypythias-evals dataset.

Limitations

These models are released for research purposes only. They are not intended for deployment in production systems.

  • Not instruction-tuned: These are base language models that predict the next token; they will not follow instructions like ChatGPT
  • May generate harmful content: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content
  • English only: Models were trained primarily on English text
  • No safety filtering: Outputs are not filtered for safety or accuracy

License

Apache 2.0

Contact

For questions about these models, please use:

Citation

If you use these models, please cite:

@inproceedings{vanderwal2025polypythias,
    title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs},
    author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella},
    booktitle={International Conference on Learning Representations},
    year={2025},
    url={https://arxiv.org/abs/2503.09543}
}
Downloads last month
136
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train EleutherAI/pythia-410m-seed9

Collection including EleutherAI/pythia-410m-seed9

Paper for EleutherAI/pythia-410m-seed9