GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

Model Description

  • Model type: GPT-2 (125M parameters)
  • Architecture: Transformer-based autoregressive language model following the original GPT-2 design
  • Training data: Uses multiple datasets (check tags) - 18Billion tokens.
  • Language: English

Performance and Evaluation

Dataset Metric thecr7guy/gpt2-pretrain GPT-2 (baseline)
HellaSwag acc 0.291 0.289
SciQ acc 0.754 0.752
Winogrande acc 0.491 0.516
TruthfulQA MC1 acc 0.236 0.228
MMLU (overall) acc 0.230 0.229
- Humanities acc 0.242 0.242
- Social Sci. acc 0.217 0.217
- STEM acc 0.213 0.213
- Other acc 0.239 0.238

Training Details

  • Training corpus: Approximately 18B tokens (120GB)
  • Training duration: 1 epochs (approximately 8 hours total)
  • Hardware: 8× NVIDIA A100 PCE GPUs via runpod.io
  • Estimated cost: $ (8*13.52) for complete training
  • Token context: 1024 tokens

Hyperparameters

  • context_len: 1024
  • seed: 42
  • epochs: 2
  • batch_size: 64
  • total_batch_size: 524288 tokens
  • grad_clip: 1.0
  • optimizer: "adamw"
  • max_lr: 6.0e-4
  • min_lr: 6.0e-5
  • beta1: 0.9
  • beta2: 0.95
  • weight_decay: 0.1

.

Commands used during installation

  • pip install wandb
  • pip install tiktoken
  • pip install --upgrade huggingface_hub
  • pip install torchinfo
  • pip install datasets
  • sudo apt update && sudo apt install tmux
  • tmux new -s training
  • wandb login
  • CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1
    torchrun --standalone --nproc_per_node=8 train.py

Contact

GitHub: thecr7guy2

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thecr7guy/gpt2-pretrain

Finetuned
(1910)
this model
Finetunes
1 model

Datasets used to train thecr7guy/gpt2-pretrain