2025-11-25 00:38:52,514 - INFO - Starting training with args: Namespace(regime='text', data_path='data/training/splits_510k/train_arrow', output_dir='outputs/production_text_ctx277_lm_20251125_003839', objective='lm', val_data_path='data/training/splits_510k/val_arrow', max_samples=None, vision_mode='small', text_context_tokens=277, hybrid_text_tokens=0, vision_prompt=None, train_encoder=False, encoder_lr=1e-05, compression_window_size=9, compression_stride=9, subsample_strategy='regular', subsample_count=None, projection_dim=None, train_projection=False, compression_target=None, conv_kernel=5, timestamp='20251125_003839', batch_size=12, gradient_accumulation_steps=4, learning_rate=0.0001, weight_decay=0.01, num_epochs=1, warmup_ratio=0.1, max_grad_norm=1.0, log_steps=10, save_steps=0, eval_steps=2000, initial_validation=True, validation_only=False, no_checkpoints=False, num_qualitative_samples=0, max_generation_tokens=200, use_wandb=True, wandb_project='vision-compression-2', wandb_run_name='production_text_ctx277_lm_20251125_003839', resume_from_checkpoint=None, resume=None, init_from_checkpoint=None, allow_objective_switch=False, aux_loss_weight=0.5, num_workers=16, prefetch_factor=4, seed=42, eval_seed=42, debug_log_sample_ids=False, device='cuda', compile=False, compile_mode='default', use_optimized_model=True, use_encoder_checkpointing=True, use_decoder_checkpointing=True, use_8bit_optimizer=True) 2025-11-25 00:38:52,514 - INFO - Setting random seed: 42 2025-11-25 00:38:54,008 - INFO - Initialized W&B run: vision-compression-2/production_text_ctx277_lm_20251125_003839 (ID: y619ou6b) 2025-11-25 00:38:54,008 - INFO - Loading model and tokenizer... 2025-11-25 00:39:01,942 - INFO - Enabling decoder gradient checkpointing... 2025-11-25 00:39:01,948 - INFO - ✓ Decoder checkpointing enabled for 12 transformer layers 2025-11-25 00:39:01,948 - INFO - Expected: ~30-50% activation memory reduction, ~15-20% compute overhead 2025-11-25 00:39:01,974 - INFO - Created Text Baseline trainer 2025-11-25 00:39:01,974 - INFO - Training objective: lm 2025-11-25 00:39:01,999 - INFO - Logged parameter counts to W&B: total=2,934,734,080, trainable=2,934,734,080, encoder=0, decoder=2,934,734,080 2025-11-25 00:39:02,000 - INFO - Loading training data from data/training/splits_510k/train_arrow 2025-11-25 00:39:02,000 - INFO - Detected Arrow format: data/training/splits_510k/train_arrow 2025-11-25 00:39:02,000 - INFO - Loading Arrow dataset from data/training/splits_510k/train_arrow (memory-mapped) 2025-11-25 00:39:02,045 - INFO - Loaded 500,000 samples from data/training/splits_510k/train_arrow (memory-mapped) 2025-11-25 00:39:02,046 - INFO - Text baseline context tokens per sample: 277 2025-11-25 00:39:02,077 - INFO - Loading validation data from data/training/splits_510k/val_arrow 2025-11-25 00:39:02,077 - INFO - Detected Arrow format: data/training/splits_510k/val_arrow 2025-11-25 00:39:02,077 - INFO - Loading Arrow dataset from data/training/splits_510k/val_arrow (memory-mapped) 2025-11-25 00:39:02,084 - INFO - Loaded 10,000 samples from data/training/splits_510k/val_arrow (memory-mapped) 2025-11-25 00:39:02,085 - INFO - Validation text context tokens per sample: 277 2025-11-25 00:39:04,156 - INFO - Created 8-bit AdamW optimizer (bitsandbytes): Learning rate: 0.0001 Memory savings: ~75% optimizer state (16.8GB for 2.8B params) Expected overhead: ~2-5% 2025-11-25 00:39:04,157 - INFO - Created scheduler with warmup_steps=1041, total_steps=10417 2025-11-25 00:39:04,164 - INFO - Logged optimizer config to W&B: type=adamw_8bit, memory=5.47GB 2025-11-25 00:39:04,164 - INFO - Starting training loop... 2025-11-25 00:39:04,164 - INFO - ====================================================================== 2025-11-25 00:39:04,165 - INFO - Running initial validation (before any training)... 2025-11-25 00:39:04,165 - INFO - ====================================================================== 2025-11-25 00:43:06,719 - INFO - Validation loss: 1.8097, perplexity: 6.11 2025-11-25 00:43:06,735 - INFO - Initial validation - Loss: 1.8097, Perplexity: 6.11 2025-11-25 00:43:06,735 - INFO - ====================================================================== 2025-11-25 00:43:07,919 - INFO - Cleared GPU memory cache after initial validation 2025-11-25 00:43:07,920 - INFO - ====================================================================== 2025-11-25 00:43:07,920 - INFO - Epoch 1/1 2025-11-25 00:43:07,921 - INFO - ====================================================================== 2025-11-25 00:43:08,813 - WARNING - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers. 2025-11-25 00:43:09,505 - INFO - Effective context tokens (per-sample): 278 | Compression ratio: 3.60x 2025-11-25 00:43:09,505 - INFO - Target tokens per sample: 1000 2025-11-25 00:44:37,147 - INFO - Epoch 1 Step 10 (Global: 10): loss=1.7902, ppl=5.99, grad_norm=1.21, lr=1.09e-05, throughput=5381 tok/s 2025-11-25 00:46:04,484 - INFO - Epoch 1 Step 20 (Global: 20): loss=1.6053, ppl=4.98, grad_norm=1.09, lr=1.17e-05, throughput=5496 tok/s 2025-11-25 00:47:30,978 - INFO - Epoch 1 Step 30 (Global: 30): loss=1.8604, ppl=6.43, grad_norm=1.34, lr=1.26e-05, throughput=5550 tok/s 2025-11-25 00:48:57,782 - INFO - Epoch 1 Step 40 (Global: 40): loss=1.8850, ppl=6.59, grad_norm=1.20, lr=1.35e-05, throughput=5530 tok/s 2025-11-25 00:50:24,618 - INFO - Epoch 1 Step 50 (Global: 50): loss=1.8463, ppl=6.34, grad_norm=1.20, lr=1.43e-05, throughput=5528 tok/s 2025-11-25 00:51:50,983 - INFO - Epoch 1 Step 60 (Global: 60): loss=1.7529, ppl=5.77, grad_norm=1.27, lr=1.52e-05, throughput=5558 tok/s 2025-11-25 00:53:17,404 - INFO - Epoch 1 Step 70 (Global: 70): loss=1.7752, ppl=5.90, grad_norm=1.26, lr=1.61e-05, throughput=5554 tok/s 2025-11-25 00:54:43,613 - INFO - Epoch 1 Step 80 (Global: 80): loss=1.8069, ppl=6.09, grad_norm=1.13, lr=1.69e-05, throughput=5568 tok/s 2025-11-25 00:56:09,715 - INFO - Epoch 1 Step 90 (Global: 90): loss=1.5943, ppl=4.93, grad_norm=1.48, lr=1.78e-05, throughput=5575 tok/s 2025-11-25 00:57:36,237 - INFO - Epoch 1 Step 100 (Global: 100): loss=2.0173, ppl=7.52, grad_norm=1.11, lr=1.86e-05, throughput=5548 tok/s 2025-11-25 00:59:02,422 - INFO - Epoch 1 Step 110 (Global: 110): loss=1.8306, ppl=6.24, grad_norm=1.23, lr=1.95e-05, throughput=5569 tok/s 2025-11-25 01:00:28,439 - INFO - Epoch 1 Step 120 (Global: 120): loss=1.8391, ppl=6.29, grad_norm=1.09, lr=2.04e-05, throughput=5580 tok/s 2025-11-25 01:01:55,114 - INFO - Epoch 1 Step 130 (Global: 130): loss=1.7698, ppl=5.87, grad_norm=1.25, lr=2.12e-05, throughput=5538 tok/s 2025-11-25 01:03:21,175 - INFO - Epoch 1 Step 140 (Global: 140): loss=1.6389, ppl=5.15, grad_norm=1.23, lr=2.21e-05, throughput=5577 tok/s 2025-11-25 01:04:47,072 - INFO - Epoch 1 Step 150 (Global: 150): loss=1.7033, ppl=5.49, grad_norm=1.12, lr=2.30e-05, throughput=5588 tok/s 2025-11-25 01:06:13,165 - INFO - Epoch 1 Step 160 (Global: 160): loss=1.6680, ppl=5.30, grad_norm=1.09, lr=2.38e-05, throughput=5575 tok/s 2025-11-25 01:07:39,275 - INFO - Epoch 1 Step 170 (Global: 170): loss=1.7284, ppl=5.63, grad_norm=1.25, lr=2.47e-05, throughput=5574 tok/s 2025-11-25 01:09:05,652 - INFO - Epoch 1 Step 180 (Global: 180): loss=1.6441, ppl=5.18, grad_norm=1.25, lr=2.56e-05, throughput=5557 tok/s 2025-11-25 01:10:32,072 - INFO - Epoch 1 Step 190 (Global: 190): loss=1.6627, ppl=5.27, grad_norm=1.38, lr=2.64e-05, throughput=5554 tok/s 2025-11-25 01:11:58,627 - INFO - Epoch 1 Step 200 (Global: 200): loss=1.5315, ppl=4.63, grad_norm=1.26, lr=2.73e-05, throughput=5546 tok/s 2025-11-25 01:13:25,322 - INFO - Epoch 1 Step 210 (Global: 210): loss=1.6755, ppl=5.34, grad_norm=1.16, lr=2.82e-05, throughput=5537 tok/s 2025-11-25 01:14:51,569 - INFO - Epoch 1 Step 220 (Global: 220): loss=1.5644, ppl=4.78, grad_norm=1.16, lr=2.90e-05, throughput=5565 tok/s 2025-11-25 01:16:17,893 - INFO - Epoch 1 Step 230 (Global: 230): loss=1.8699, ppl=6.49, grad_norm=1.19, lr=2.99e-05, throughput=5560 tok/s 2025-11-25 01:17:44,488 - INFO - Epoch 1 Step 240 (Global: 240): loss=1.7021, ppl=5.49, grad_norm=1.18, lr=3.07e-05, throughput=5543 tok/s 2025-11-25 01:19:10,775 - INFO - Epoch 1 Step 250 (Global: 250): loss=1.7297, ppl=5.64, grad_norm=1.33, lr=3.16e-05, throughput=5563 tok/s 2025-11-25 01:20:37,539 - INFO - Epoch 1 Step 260 (Global: 260): loss=1.6451, ppl=5.18, grad_norm=1.17, lr=3.25e-05, throughput=5532 tok/s 2025-11-25 01:22:04,298 - INFO - Epoch 1 Step 270 (Global: 270): loss=1.5299, ppl=4.62, grad_norm=1.23, lr=3.33e-05, throughput=5533 tok/s 2025-11-25 01:23:30,699 - INFO - Epoch 1 Step 280 (Global: 280): loss=1.6722, ppl=5.32, grad_norm=1.19, lr=3.42e-05, throughput=5556 tok/s 2025-11-25 01:24:56,639 - INFO - Epoch 1 Step 290 (Global: 290): loss=1.9722, ppl=7.19, grad_norm=1.27, lr=3.51e-05, throughput=5585 tok/s 2025-11-25 01:26:23,158 - INFO - Epoch 1 Step 300 (Global: 300): loss=1.7945, ppl=6.02, grad_norm=1.25, lr=3.59e-05, throughput=5548 tok/s 2025-11-25 01:27:49,594 - INFO - Epoch 1 Step 310 (Global: 310): loss=1.6963, ppl=5.45, grad_norm=1.19, lr=3.68e-05, throughput=5553 tok/s 2025-11-25 01:29:15,983 - INFO - Epoch 1 Step 320 (Global: 320): loss=1.7740, ppl=5.89, grad_norm=1.14, lr=3.77e-05, throughput=5556 tok/s 2025-11-25 01:30:42,060 - INFO - Epoch 1 Step 330 (Global: 330): loss=1.5410, ppl=4.67, grad_norm=1.17, lr=3.85e-05, throughput=5576 tok/s 2025-11-25 01:32:08,285 - INFO - Epoch 1 Step 340 (Global: 340): loss=1.5653, ppl=4.78, grad_norm=1.17, lr=3.94e-05, throughput=5567 tok/s 2025-11-25 01:33:34,647 - INFO - Epoch 1 Step 350 (Global: 350): loss=1.6160, ppl=5.03, grad_norm=1.30, lr=4.03e-05, throughput=5558 tok/s 2025-11-25 01:35:00,352 - INFO - Epoch 1 Step 360 (Global: 360): loss=1.5804, ppl=4.86, grad_norm=1.39, lr=4.11e-05, throughput=5601 tok/s 2025-11-25 01:36:26,645 - INFO - Epoch 1 Step 370 (Global: 370): loss=1.6286, ppl=5.10, grad_norm=1.55, lr=4.20e-05, throughput=5563 tok/s 2025-11-25 01:37:52,752 - INFO - Epoch 1 Step 380 (Global: 380): loss=1.5932, ppl=4.92, grad_norm=1.39, lr=4.29e-05, throughput=5575 tok/s 2025-11-25 01:39:18,962 - INFO - Epoch 1 Step 390 (Global: 390): loss=1.7064, ppl=5.51, grad_norm=1.05, lr=4.37e-05, throughput=5568 tok/s 2025-11-25 01:40:45,135 - INFO - Epoch 1 Step 400 (Global: 400): loss=1.8765, ppl=6.53, grad_norm=1.24, lr=4.46e-05, throughput=5570 tok/s 2025-11-25 01:42:11,277 - INFO - Epoch 1 Step 410 (Global: 410): loss=1.7458, ppl=5.73, grad_norm=1.28, lr=4.54e-05, throughput=5572 tok/s 2025-11-25 01:43:37,501 - INFO - Epoch 1 Step 420 (Global: 420): loss=1.7279, ppl=5.63, grad_norm=1.12, lr=4.63e-05, throughput=5567 tok/s 2025-11-25 01:45:03,521 - INFO - Epoch 1 Step 430 (Global: 430): loss=1.7310, ppl=5.65, grad_norm=1.20, lr=4.72e-05, throughput=5580 tok/s 2025-11-25 01:46:30,408 - INFO - Epoch 1 Step 440 (Global: 440): loss=1.7508, ppl=5.76, grad_norm=1.17, lr=4.80e-05, throughput=5524 tok/s 2025-11-25 01:47:56,824 - INFO - Epoch 1 Step 450 (Global: 450): loss=1.8415, ppl=6.31, grad_norm=1.34, lr=4.89e-05, throughput=5555 tok/s 2025-11-25 01:49:23,320 - INFO - Epoch 1 Step 460 (Global: 460): loss=1.5765, ppl=4.84, grad_norm=1.30, lr=4.98e-05, throughput=5549 tok/s 2025-11-25 01:50:49,263 - INFO - Epoch 1 Step 470 (Global: 470): loss=1.7627, ppl=5.83, grad_norm=1.53, lr=5.06e-05, throughput=5585 tok/s 2025-11-25 01:52:15,895 - INFO - Epoch 1 Step 480 (Global: 480): loss=1.6384, ppl=5.15, grad_norm=1.14, lr=5.15e-05, throughput=5541 tok/s 2025-11-25 01:53:41,917 - INFO - Epoch 1 Step 490 (Global: 490): loss=1.6619, ppl=5.27, grad_norm=1.27, lr=5.24e-05, throughput=5580 tok/s 2025-11-25 01:55:08,421 - INFO - Epoch 1 Step 500 (Global: 500): loss=1.7970, ppl=6.03, grad_norm=1.41, lr=5.32e-05, throughput=5549 tok/s 2025-11-25 01:56:34,451 - INFO - Epoch 1 Step 510 (Global: 510): loss=1.5791, ppl=4.85, grad_norm=1.16, lr=5.41e-05, throughput=5580 tok/s 2025-11-25 01:58:00,414 - INFO - Epoch 1 Step 520 (Global: 520): loss=1.5442, ppl=4.68, grad_norm=1.44, lr=5.50e-05, throughput=5584 tok/s 2025-11-25 01:59:26,534 - INFO - Epoch 1 Step 530 (Global: 530): loss=1.5917, ppl=4.91, grad_norm=1.21, lr=5.58e-05, throughput=5574 tok/s 2025-11-25 02:00:52,513 - INFO - Epoch 1 Step 540 (Global: 540): loss=1.8955, ppl=6.66, grad_norm=1.38, lr=5.67e-05, throughput=5583 tok/s 2025-11-25 02:02:18,821 - INFO - Epoch 1 Step 550 (Global: 550): loss=1.6600, ppl=5.26, grad_norm=1.31, lr=5.76e-05, throughput=5562 tok/s 2025-11-25 02:03:45,181 - INFO - Epoch 1 Step 560 (Global: 560): loss=1.7742, ppl=5.90, grad_norm=1.12, lr=5.84e-05, throughput=5558 tok/s 2025-11-25 02:05:12,557 - INFO - Epoch 1 Step 570 (Global: 570): loss=1.5940, ppl=4.92, grad_norm=1.11, lr=5.93e-05, throughput=5494 tok/s 2025-11-25 02:06:38,922 - INFO - Epoch 1 Step 580 (Global: 580): loss=1.5840, ppl=4.87, grad_norm=1.12, lr=6.01e-05, throughput=5558 tok/s 2025-11-25 02:08:05,395 - INFO - Epoch 1 Step 590 (Global: 590): loss=1.7377, ppl=5.68, grad_norm=1.28, lr=6.10e-05, throughput=5551 tok/s 2025-11-25 02:09:31,667 - INFO - Epoch 1 Step 600 (Global: 600): loss=1.8257, ppl=6.21, grad_norm=1.28, lr=6.19e-05, throughput=5564 tok/s 2025-11-25 02:10:57,954 - INFO - Epoch 1 Step 610 (Global: 610): loss=1.8477, ppl=6.35, grad_norm=1.13, lr=6.27e-05, throughput=5563 tok/s 2025-11-25 02:12:24,071 - INFO - Epoch 1 Step 620 (Global: 620): loss=1.7262, ppl=5.62, grad_norm=1.13, lr=6.36e-05, throughput=5574 tok/s 2025-11-25 02:13:50,256 - INFO - Epoch 1 Step 630 (Global: 630): loss=1.8508, ppl=6.37, grad_norm=1.12, lr=6.45e-05, throughput=5569 tok/s 2025-11-25 02:15:16,334 - INFO - Epoch 1 Step 640 (Global: 640): loss=1.8857, ppl=6.59, grad_norm=1.38, lr=6.53e-05, throughput=5576 tok/s 2025-11-25 02:16:42,349 - INFO - Epoch 1 Step 650 (Global: 650): loss=1.7458, ppl=5.73, grad_norm=1.12, lr=6.62e-05, throughput=5581 tok/s 2025-11-25 02:18:08,355 - INFO - Epoch 1 Step 660 (Global: 660): loss=1.6821, ppl=5.38, grad_norm=1.30, lr=6.71e-05, throughput=5581 tok/s 2025-11-25 02:19:34,513 - INFO - Epoch 1 Step 670 (Global: 670): loss=1.8563, ppl=6.40, grad_norm=1.24, lr=6.79e-05, throughput=5571 tok/s 2025-11-25 02:21:00,933 - INFO - Epoch 1 Step 680 (Global: 680): loss=1.5897, ppl=4.90, grad_norm=1.50, lr=6.88e-05, throughput=5554 tok/s 2025-11-25 02:22:27,073 - INFO - Epoch 1 Step 690 (Global: 690): loss=1.5501, ppl=4.71, grad_norm=1.10, lr=6.97e-05, throughput=5572 tok/s 2025-11-25 02:23:53,407 - INFO - Epoch 1 Step 700 (Global: 700): loss=1.6137, ppl=5.02, grad_norm=1.13, lr=7.05e-05, throughput=5560 tok/s 2025-11-25 02:25:19,839 - INFO - Epoch 1 Step 710 (Global: 710): loss=1.7611, ppl=5.82, grad_norm=1.05, lr=7.14e-05, throughput=5554 tok/s 2025-11-25 02:26:46,152 - INFO - Epoch 1 Step 720 (Global: 720): loss=1.9502, ppl=7.03, grad_norm=1.16, lr=7.22e-05, throughput=5561 tok/s 2025-11-25 02:28:12,439 - INFO - Epoch 1 Step 730 (Global: 730): loss=1.8214, ppl=6.18, grad_norm=1.21, lr=7.31e-05, throughput=5563 tok/s 2025-11-25 02:29:38,851 - INFO - Epoch 1 Step 740 (Global: 740): loss=1.6712, ppl=5.32, grad_norm=1.17, lr=7.40e-05, throughput=5555 tok/s 2025-11-25 02:31:05,477 - INFO - Epoch 1 Step 750 (Global: 750): loss=1.9008, ppl=6.69, grad_norm=1.16, lr=7.48e-05, throughput=5541 tok/s 2025-11-25 02:32:33,152 - INFO - Epoch 1 Step 760 (Global: 760): loss=1.4862, ppl=4.42, grad_norm=1.06, lr=7.57e-05, throughput=5475 tok/s 2025-11-25 02:33:59,346 - INFO - Epoch 1 Step 770 (Global: 770): loss=1.5782, ppl=4.85, grad_norm=1.07, lr=7.66e-05, throughput=5569 tok/s 2025-11-25 02:35:25,477 - INFO - Epoch 1 Step 780 (Global: 780): loss=1.5149, ppl=4.55, grad_norm=1.13, lr=7.74e-05, throughput=5573 tok/s 2025-11-25 02:36:51,435 - INFO - Epoch 1 Step 790 (Global: 790): loss=1.6797, ppl=5.36, grad_norm=1.27, lr=7.83e-05, throughput=5584 tok/s 2025-11-25 02:38:17,523 - INFO - Epoch 1 Step 800 (Global: 800): loss=1.7550, ppl=5.78, grad_norm=1.07, lr=7.92e-05, throughput=5576 tok/s 2025-11-25 02:39:43,757 - INFO - Epoch 1 Step 810 (Global: 810): loss=1.5457, ppl=4.69, grad_norm=1.04, lr=8.00e-05, throughput=5566 tok/s 2025-11-25 02:41:09,851 - INFO - Epoch 1 Step 820 (Global: 820): loss=1.5157, ppl=4.55, grad_norm=1.07, lr=8.09e-05, throughput=5575 tok/s 2025-11-25 02:42:36,079 - INFO - Epoch 1 Step 830 (Global: 830): loss=1.4899, ppl=4.44, grad_norm=1.23, lr=8.18e-05, throughput=5567 tok/s 2025-11-25 02:44:02,140 - INFO - Epoch 1 Step 840 (Global: 840): loss=1.6916, ppl=5.43, grad_norm=1.05, lr=8.26e-05, throughput=5578 tok/s 2025-11-25 02:45:28,223 - INFO - Epoch 1 Step 850 (Global: 850): loss=1.8418, ppl=6.31, grad_norm=1.10, lr=8.35e-05, throughput=5576 tok/s 2025-11-25 02:46:54,353 - INFO - Epoch 1 Step 860 (Global: 860): loss=1.7330, ppl=5.66, grad_norm=1.05, lr=8.44e-05, throughput=5573 tok/s 2025-11-25 02:48:20,362 - INFO - Epoch 1 Step 870 (Global: 870): loss=1.7655, ppl=5.84, grad_norm=1.09, lr=8.52e-05, throughput=5581 tok/s 2025-11-25 02:49:46,341 - INFO - Epoch 1 Step 880 (Global: 880): loss=1.7401, ppl=5.70, grad_norm=1.10, lr=8.61e-05, throughput=5583 tok/s 2025-11-25 02:51:12,388 - INFO - Epoch 1 Step 890 (Global: 890): loss=1.5810, ppl=4.86, grad_norm=1.09, lr=8.69e-05, throughput=5578 tok/s 2025-11-25 02:52:38,497 - INFO - Epoch 1 Step 900 (Global: 900): loss=1.9887, ppl=7.31, grad_norm=1.10, lr=8.78e-05, throughput=5574 tok/s 2025-11-25 02:54:04,458 - INFO - Epoch 1 Step 910 (Global: 910): loss=1.7685, ppl=5.86, grad_norm=1.10, lr=8.87e-05, throughput=5584 tok/s 2025-11-25 02:55:30,523 - INFO - Epoch 1 Step 920 (Global: 920): loss=2.0049, ppl=7.43, grad_norm=1.17, lr=8.95e-05, throughput=5577 tok/s 2025-11-25 02:56:56,636 - INFO - Epoch 1 Step 930 (Global: 930): loss=1.6376, ppl=5.14, grad_norm=1.02, lr=9.04e-05, throughput=5574 tok/s 2025-11-25 02:58:22,610 - INFO - Epoch 1 Step 940 (Global: 940): loss=1.8992, ppl=6.68, grad_norm=1.05, lr=9.13e-05, throughput=5583 tok/s 2025-11-25 02:59:48,699 - INFO - Epoch 1 Step 950 (Global: 950): loss=1.6492, ppl=5.20, grad_norm=1.07, lr=9.21e-05, throughput=5576 tok/s 2025-11-25 03:01:14,879 - INFO - Epoch 1 Step 960 (Global: 960): loss=1.8470, ppl=6.34, grad_norm=1.04, lr=9.30e-05, throughput=5570 tok/s 2025-11-25 03:02:41,299 - INFO - Epoch 1 Step 970 (Global: 970): loss=1.4249, ppl=4.16, grad_norm=1.05, lr=9.39e-05, throughput=5554 tok/s 2025-11-25 03:04:07,446 - INFO - Epoch 1 Step 980 (Global: 980): loss=1.8178, ppl=6.16, grad_norm=1.19, lr=9.47e-05, throughput=5572 tok/s 2025-11-25 03:05:33,603 - INFO - Epoch 1 Step 990 (Global: 990): loss=1.7402, ppl=5.70, grad_norm=1.04, lr=9.56e-05, throughput=5571 tok/s 2025-11-25 03:06:59,882 - INFO - Epoch 1 Step 1000 (Global: 1000): loss=2.0780, ppl=7.99, grad_norm=1.28, lr=9.65e-05, throughput=5563 tok/s 2025-11-25 03:08:26,062 - INFO - Epoch 1 Step 1010 (Global: 1010): loss=1.6899, ppl=5.42, grad_norm=1.01, lr=9.73e-05, throughput=5570 tok/s 2025-11-25 03:09:52,326 - INFO - Epoch 1 Step 1020 (Global: 1020): loss=1.8750, ppl=6.52, grad_norm=1.20, lr=9.82e-05, throughput=5564 tok/s 2025-11-25 03:11:18,676 - INFO - Epoch 1 Step 1030 (Global: 1030): loss=1.7874, ppl=5.97, grad_norm=1.06, lr=9.90e-05, throughput=5559 tok/s 2025-11-25 03:12:44,998 - INFO - Epoch 1 Step 1040 (Global: 1040): loss=1.8869, ppl=6.60, grad_norm=1.24, lr=9.99e-05, throughput=5561 tok/s 2025-11-25 03:14:11,446 - INFO - Epoch 1 Step 1050 (Global: 1050): loss=1.8306, ppl=6.24, grad_norm=1.02, lr=1.00e-04, throughput=5553 tok/s 2025-11-25 03:15:37,956 - INFO - Epoch 1 Step 1060 (Global: 1060): loss=1.6307, ppl=5.11, grad_norm=0.94, lr=1.00e-04, throughput=5549 tok/s 2025-11-25 03:17:04,178 - INFO - Epoch 1 Step 1070 (Global: 1070): loss=1.8278, ppl=6.22, grad_norm=1.04, lr=1.00e-04, throughput=5567 tok/s 2025-11-25 03:18:30,284 - INFO - Epoch 1 Step 1080 (Global: 1080): loss=1.8660, ppl=6.46, grad_norm=0.98, lr=1.00e-04, throughput=5575 tok/s 2025-11-25 03:19:56,532 - INFO - Epoch 1 Step 1090 (Global: 1090): loss=1.6583, ppl=5.25, grad_norm=0.97, lr=1.00e-04, throughput=5565 tok/s 2025-11-25 03:21:22,990 - INFO - Epoch 1 Step 1100 (Global: 1100): loss=1.7084, ppl=5.52, grad_norm=0.96, lr=1.00e-04, throughput=5552 tok/s 2025-11-25 03:22:49,087 - INFO - Epoch 1 Step 1110 (Global: 1110): loss=1.9141, ppl=6.78, grad_norm=1.06, lr=1.00e-04, throughput=5575 tok/s 2025-11-25 03:24:15,149 - INFO - Epoch 1 Step 1120 (Global: 1120): loss=1.8385, ppl=6.29, grad_norm=1.04, lr=1.00e-04, throughput=5577 tok/s 2025-11-25 03:25:41,397 - INFO - Epoch 1 Step 1130 (Global: 1130): loss=1.7798, ppl=5.93, grad_norm=1.01, lr=1.00e-04, throughput=5565 tok/s 2025-11-25 03:27:07,422 - INFO - Epoch 1 Step 1140 (Global: 1140): loss=1.8785, ppl=6.54, grad_norm=1.04, lr=1.00e-04, throughput=5580 tok/s 2025-11-25 03:28:33,429 - INFO - Epoch 1 Step 1150 (Global: 1150): loss=1.8441, ppl=6.32, grad_norm=1.15, lr=1.00e-04, throughput=5581 tok/s 2025-11-25 03:29:59,240 - INFO - Epoch 1 Step 1160 (Global: 1160): loss=1.6271, ppl=5.09, grad_norm=1.02, lr=1.00e-04, throughput=5594 tok/s 2025-11-25 03:31:25,392 - INFO - Epoch 1 Step 1170 (Global: 1170): loss=1.6733, ppl=5.33, grad_norm=0.98, lr=1.00e-04, throughput=5572 tok/s 2025-11-25 03:32:51,355 - INFO - Epoch 1 Step 1180 (Global: 1180): loss=1.6908, ppl=5.42, grad_norm=1.02, lr=9.99e-05, throughput=5584 tok/s 2025-11-25 03:34:17,350 - INFO - Epoch 1 Step 1190 (Global: 1190): loss=1.9826, ppl=7.26, grad_norm=1.11, lr=9.99e-05, throughput=5582 tok/s 2025-11-25 03:35:43,171 - INFO - Epoch 1 Step 1200 (Global: 1200): loss=1.6715, ppl=5.32, grad_norm=0.91, lr=9.99e-05, throughput=5593 tok/s 2025-11-25 03:37:09,240 - INFO - Epoch 1 Step 1210 (Global: 1210): loss=1.8184, ppl=6.16, grad_norm=1.02, lr=9.99e-05, throughput=5577 tok/s 2025-11-25 03:38:35,147 - INFO - Epoch 1 Step 1220 (Global: 1220): loss=1.8818, ppl=6.57, grad_norm=0.90, lr=9.99e-05, throughput=5587 tok/s 2025-11-25 03:40:01,470 - INFO - Epoch 1 Step 1230 (Global: 1230): loss=1.7648, ppl=5.84, grad_norm=0.98, lr=9.99e-05, throughput=5561 tok/s 2025-11-25 03:41:27,396 - INFO - Epoch 1 Step 1240 (Global: 1240): loss=1.7628, ppl=5.83, grad_norm=0.92, lr=9.99e-05, throughput=5586 tok/s 2025-11-25 03:42:53,550 - INFO - Epoch 1 Step 1250 (Global: 1250): loss=1.8610, ppl=6.43, grad_norm=0.98, lr=9.99e-05, throughput=5571 tok/s 2025-11-25 03:44:19,626 - INFO - Epoch 1 Step 1260 (Global: 1260): loss=1.6604, ppl=5.26, grad_norm=0.86, lr=9.99e-05, throughput=5577 tok/s 2025-11-25 03:45:45,431 - INFO - Epoch 1 Step 1270 (Global: 1270): loss=1.6227, ppl=5.07, grad_norm=0.94, lr=9.99e-05, throughput=5594 tok/s 2025-11-25 03:47:11,196 - INFO - Epoch 1 Step 1280 (Global: 1280): loss=1.7837, ppl=5.95, grad_norm=0.98, lr=9.98e-05, throughput=5597 tok/s 2025-11-25 03:48:37,312 - INFO - Epoch 1 Step 1290 (Global: 1290): loss=1.7335, ppl=5.66, grad_norm=0.98, lr=9.98e-05, throughput=5574 tok/s 2025-11-25 03:50:03,186 - INFO - Epoch 1 Step 1300 (Global: 1300): loss=1.5673, ppl=4.79, grad_norm=0.89, lr=9.98e-05, throughput=5590 tok/s 2025-11-25 03:51:28,931 - INFO - Epoch 1 Step 1310 (Global: 1310): loss=1.7276, ppl=5.63, grad_norm=1.10, lr=9.98e-05, throughput=5598 tok/s 2025-11-25 03:52:55,334 - INFO - Epoch 1 Step 1320 (Global: 1320): loss=1.7981, ppl=6.04, grad_norm=1.02, lr=9.98e-05, throughput=5555 tok/s 2025-11-25 03:54:21,436 - INFO - Epoch 1 Step 1330 (Global: 1330): loss=1.5918, ppl=4.91, grad_norm=0.89, lr=9.98e-05, throughput=5575 tok/s 2025-11-25 03:55:47,512 - INFO - Epoch 1 Step 1340 (Global: 1340): loss=1.7310, ppl=5.65, grad_norm=0.97, lr=9.97e-05, throughput=5577 tok/s 2025-11-25 03:57:13,329 - INFO - Epoch 1 Step 1350 (Global: 1350): loss=1.7834, ppl=5.95, grad_norm=0.98, lr=9.97e-05, throughput=5593 tok/s 2025-11-25 03:58:39,091 - INFO - Epoch 1 Step 1360 (Global: 1360): loss=1.7971, ppl=6.03, grad_norm=0.93, lr=9.97e-05, throughput=5597 tok/s 2025-11-25 04:00:04,582 - INFO - Epoch 1 Step 1370 (Global: 1370): loss=1.7103, ppl=5.53, grad_norm=0.87, lr=9.97e-05, throughput=5615 tok/s 2025-11-25 04:01:30,782 - INFO - Epoch 1 Step 1380 (Global: 1380): loss=1.6225, ppl=5.07, grad_norm=1.04, lr=9.97e-05, throughput=5569 tok/s 2025-11-25 04:02:56,378 - INFO - Epoch 1 Step 1390 (Global: 1390): loss=1.8073, ppl=6.09, grad_norm=1.09, lr=9.97e-05, throughput=5608 tok/s 2025-11-25 04:04:22,319 - INFO - Epoch 1 Step 1400 (Global: 1400): loss=1.7365, ppl=5.68, grad_norm=0.97, lr=9.96e-05, throughput=5585 tok/s 2025-11-25 04:05:48,418 - INFO - Epoch 1 Step 1410 (Global: 1410): loss=1.8600, ppl=6.42, grad_norm=1.12, lr=9.96e-05, throughput=5575 tok/s 2025-11-25 04:07:14,454 - INFO - Epoch 1 Step 1420 (Global: 1420): loss=1.6961, ppl=5.45, grad_norm=0.84, lr=9.96e-05, throughput=5579 tok/s 2025-11-25 04:08:40,475 - INFO - Epoch 1 Step 1430 (Global: 1430): loss=1.8916, ppl=6.63, grad_norm=0.95, lr=9.96e-05, throughput=5580 tok/s 2025-11-25 04:10:06,501 - INFO - Epoch 1 Step 1440 (Global: 1440): loss=1.7616, ppl=5.82, grad_norm=0.93, lr=9.96e-05, throughput=5580 tok/s 2025-11-25 04:11:32,503 - INFO - Epoch 1 Step 1450 (Global: 1450): loss=1.8679, ppl=6.47, grad_norm=0.97, lr=9.95e-05, throughput=5581 tok/s 2025-11-25 04:12:58,782 - INFO - Epoch 1 Step 1460 (Global: 1460): loss=2.0044, ppl=7.42, grad_norm=1.01, lr=9.95e-05, throughput=5563 tok/s 2025-11-25 04:14:25,093 - INFO - Epoch 1 Step 1470 (Global: 1470): loss=1.6869, ppl=5.40, grad_norm=0.90, lr=9.95e-05, throughput=5561 tok/s 2025-11-25 04:15:51,311 - INFO - Epoch 1 Step 1480 (Global: 1480): loss=1.6269, ppl=5.09, grad_norm=0.87, lr=9.95e-05, throughput=5567 tok/s 2025-11-25 04:17:17,180 - INFO - Epoch 1 Step 1490 (Global: 1490): loss=1.6171, ppl=5.04, grad_norm=1.11, lr=9.94e-05, throughput=5590 tok/s 2025-11-25 04:18:42,972 - INFO - Epoch 1 Step 1500 (Global: 1500): loss=1.8481, ppl=6.35, grad_norm=0.97, lr=9.94e-05, throughput=5595 tok/s 2025-11-25 04:20:09,119 - INFO - Epoch 1 Step 1510 (Global: 1510): loss=1.8423, ppl=6.31, grad_norm=0.90, lr=9.94e-05, throughput=5572 tok/s 2025-11-25 04:21:35,175 - INFO - Epoch 1 Step 1520 (Global: 1520): loss=1.8560, ppl=6.40, grad_norm=0.92, lr=9.94e-05, throughput=5578 tok/s 2025-11-25 04:23:01,266 - INFO - Epoch 1 Step 1530 (Global: 1530): loss=1.8895, ppl=6.62, grad_norm=0.98, lr=9.93e-05, throughput=5576 tok/s 2025-11-25 04:24:27,320 - INFO - Epoch 1 Step 1540 (Global: 1540): loss=1.9558, ppl=7.07, grad_norm=1.00, lr=9.93e-05, throughput=5578 tok/s 2025-11-25 04:25:53,264 - INFO - Epoch 1 Step 1550 (Global: 1550): loss=1.7220, ppl=5.60, grad_norm=1.07, lr=9.93e-05, throughput=5585 tok/s 2025-11-25 04:27:19,326 - INFO - Epoch 1 Step 1560 (Global: 1560): loss=1.7270, ppl=5.62, grad_norm=1.01, lr=9.92e-05, throughput=5577 tok/s 2025-11-25 04:28:45,354 - INFO - Epoch 1 Step 1570 (Global: 1570): loss=1.5964, ppl=4.94, grad_norm=0.86, lr=9.92e-05, throughput=5580 tok/s 2025-11-25 04:30:11,373 - INFO - Epoch 1 Step 1580 (Global: 1580): loss=1.8035, ppl=6.07, grad_norm=0.90, lr=9.92e-05, throughput=5580 tok/s 2025-11-25 04:31:37,538 - INFO - Epoch 1 Step 1590 (Global: 1590): loss=1.7871, ppl=5.97, grad_norm=0.86, lr=9.92e-05, throughput=5571 tok/s 2025-11-25 04:33:03,676 - INFO - Epoch 1 Step 1600 (Global: 1600): loss=1.6350, ppl=5.13, grad_norm=0.89, lr=9.91e-05, throughput=5573 tok/s 2025-11-25 04:34:29,400 - INFO - Epoch 1 Step 1610 (Global: 1610): loss=1.8112, ppl=6.12, grad_norm=0.90, lr=9.91e-05, throughput=5599 tok/s 2025-11-25 04:35:55,407 - INFO - Epoch 1 Step 1620 (Global: 1620): loss=1.6600, ppl=5.26, grad_norm=0.86, lr=9.91e-05, throughput=5581 tok/s 2025-11-25 04:37:21,211 - INFO - Epoch 1 Step 1630 (Global: 1630): loss=1.9770, ppl=7.22, grad_norm=1.59, lr=9.90e-05, throughput=5594 tok/s 2025-11-25 04:38:47,066 - INFO - Epoch 1 Step 1640 (Global: 1640): loss=1.6354, ppl=5.13, grad_norm=0.83, lr=9.90e-05, throughput=5591 tok/s 2025-11-25 04:40:13,133 - INFO - Epoch 1 Step 1650 (Global: 1650): loss=1.6461, ppl=5.19, grad_norm=0.95, lr=9.90e-05, throughput=5577 tok/s 2025-11-25 04:41:38,852 - INFO - Epoch 1 Step 1660 (Global: 1660): loss=1.7078, ppl=5.52, grad_norm=0.96, lr=9.89e-05, throughput=5600 tok/s 2025-11-25 04:43:05,200 - INFO - Epoch 1 Step 1670 (Global: 1670): loss=2.0459, ppl=7.74, grad_norm=1.05, lr=9.89e-05, throughput=5559 tok/s 2025-11-25 04:44:31,037 - INFO - Epoch 1 Step 1680 (Global: 1680): loss=1.6376, ppl=5.14, grad_norm=0.88, lr=9.89e-05, throughput=5592 tok/s 2025-11-25 04:45:56,980 - INFO - Epoch 1 Step 1690 (Global: 1690): loss=1.8018, ppl=6.06, grad_norm=1.02, lr=9.88e-05, throughput=5585 tok/s 2025-11-25 04:47:23,098 - INFO - Epoch 1 Step 1700 (Global: 1700): loss=1.6658, ppl=5.29, grad_norm=1.16, lr=9.88e-05, throughput=5574 tok/s 2025-11-25 04:48:49,121 - INFO - Epoch 1 Step 1710 (Global: 1710): loss=1.8767, ppl=6.53, grad_norm=0.97, lr=9.87e-05, throughput=5580 tok/s 2025-11-25 04:50:15,098 - INFO - Epoch 1 Step 1720 (Global: 1720): loss=1.7748, ppl=5.90, grad_norm=0.98, lr=9.87e-05, throughput=5583 tok/s 2025-11-25 04:51:40,890 - INFO - Epoch 1 Step 1730 (Global: 1730): loss=1.6233, ppl=5.07, grad_norm=1.05, lr=9.87e-05, throughput=5595 tok/s 2025-11-25 04:53:06,709 - INFO - Epoch 1 Step 1740 (Global: 1740): loss=1.6121, ppl=5.01, grad_norm=0.83, lr=9.86e-05, throughput=5593 tok/s 2025-11-25 04:54:32,485 - INFO - Epoch 1 Step 1750 (Global: 1750): loss=1.6832, ppl=5.38, grad_norm=0.84, lr=9.86e-05, throughput=5596 tok/s 2025-11-25 04:55:58,445 - INFO - Epoch 1 Step 1760 (Global: 1760): loss=1.7689, ppl=5.86, grad_norm=0.86, lr=9.86e-05, throughput=5584 tok/s 2025-11-25 04:57:24,339 - INFO - Epoch 1 Step 1770 (Global: 1770): loss=2.0391, ppl=7.68, grad_norm=0.98, lr=9.85e-05, throughput=5588 tok/s 2025-11-25 04:58:50,288 - INFO - Epoch 1 Step 1780 (Global: 1780): loss=1.9265, ppl=6.87, grad_norm=0.95, lr=9.85e-05, throughput=5585 tok/s 2025-11-25 05:00:16,261 - INFO - Epoch 1 Step 1790 (Global: 1790): loss=1.5611, ppl=4.76, grad_norm=0.79, lr=9.84e-05, throughput=5583 tok/s 2025-11-25 05:01:42,314 - INFO - Epoch 1 Step 1800 (Global: 1800): loss=1.7675, ppl=5.86, grad_norm=0.81, lr=9.84e-05, throughput=5578 tok/s 2025-11-25 05:03:08,192 - INFO - Epoch 1 Step 1810 (Global: 1810): loss=1.7088, ppl=5.52, grad_norm=0.94, lr=9.83e-05, throughput=5589 tok/s 2025-11-25 05:04:34,389 - INFO - Epoch 1 Step 1820 (Global: 1820): loss=1.8298, ppl=6.23, grad_norm=0.94, lr=9.83e-05, throughput=5569 tok/s 2025-11-25 05:06:00,408 - INFO - Epoch 1 Step 1830 (Global: 1830): loss=1.7902, ppl=5.99, grad_norm=0.86, lr=9.83e-05, throughput=5580 tok/s 2025-11-25 05:07:26,274 - INFO - Epoch 1 Step 1840 (Global: 1840): loss=1.6018, ppl=4.96, grad_norm=0.80, lr=9.82e-05, throughput=5590 tok/s 2025-11-25 05:08:52,228 - INFO - Epoch 1 Step 1850 (Global: 1850): loss=1.6075, ppl=4.99, grad_norm=0.88, lr=9.82e-05, throughput=5584 tok/s 2025-11-25 05:10:18,290 - INFO - Epoch 1 Step 1860 (Global: 1860): loss=1.7889, ppl=5.98, grad_norm=0.85, lr=9.81e-05, throughput=5577 tok/s 2025-11-25 05:11:44,815 - INFO - Epoch 1 Step 1870 (Global: 1870): loss=1.9555, ppl=7.07, grad_norm=0.91, lr=9.81e-05, throughput=5548 tok/s 2025-11-25 05:13:11,203 - INFO - Epoch 1 Step 1880 (Global: 1880): loss=1.7974, ppl=6.03, grad_norm=0.84, lr=9.80e-05, throughput=5556 tok/s 2025-11-25 05:14:37,220 - INFO - Epoch 1 Step 1890 (Global: 1890): loss=1.7681, ppl=5.86, grad_norm=0.92, lr=9.80e-05, throughput=5580 tok/s 2025-11-25 05:16:03,312 - INFO - Epoch 1 Step 1900 (Global: 1900): loss=1.7490, ppl=5.75, grad_norm=0.93, lr=9.79e-05, throughput=5575 tok/s 2025-11-25 05:17:29,197 - INFO - Epoch 1 Step 1910 (Global: 1910): loss=1.6624, ppl=5.27, grad_norm=0.83, lr=9.79e-05, throughput=5589 tok/s 2025-11-25 05:18:55,352 - INFO - Epoch 1 Step 1920 (Global: 1920): loss=1.6963, ppl=5.45, grad_norm=0.83, lr=9.78e-05, throughput=5571 tok/s 2025-11-25 05:20:21,415 - INFO - Epoch 1 Step 1930 (Global: 1930): loss=1.6236, ppl=5.07, grad_norm=0.88, lr=9.78e-05, throughput=5577 tok/s 2025-11-25 05:21:47,645 - INFO - Epoch 1 Step 1940 (Global: 1940): loss=1.6723, ppl=5.32, grad_norm=1.01, lr=9.77e-05, throughput=5567 tok/s 2025-11-25 05:23:13,683 - INFO - Epoch 1 Step 1950 (Global: 1950): loss=1.7777, ppl=5.92, grad_norm=0.90, lr=9.77e-05, throughput=5579 tok/s 2025-11-25 05:24:39,884 - INFO - Epoch 1 Step 1960 (Global: 1960): loss=1.6719, ppl=5.32, grad_norm=0.89, lr=9.76e-05, throughput=5568 tok/s 2025-11-25 05:26:05,850 - INFO - Epoch 1 Step 1970 (Global: 1970): loss=1.5778, ppl=4.84, grad_norm=0.94, lr=9.76e-05, throughput=5584 tok/s 2025-11-25 05:27:31,359 - INFO - Epoch 1 Step 1980 (Global: 1980): loss=1.8810, ppl=6.56, grad_norm=0.86, lr=9.75e-05, throughput=5614 tok/s 2025-11-25 05:28:57,360 - INFO - Epoch 1 Step 1990 (Global: 1990): loss=1.7558, ppl=5.79, grad_norm=0.85, lr=9.75e-05, throughput=5581 tok/s 2025-11-25 05:30:23,428 - INFO - Epoch 1 Step 2000 (Global: 2000): loss=1.4542, ppl=4.28, grad_norm=0.83, lr=9.74e-05, throughput=5577 tok/s 2025-11-25 05:30:23,429 - INFO - Running validation at step 2000... 2025-11-25 05:34:27,597 - INFO - Validation loss: 1.7708, perplexity: 5.88 2025-11-25 05:34:49,671 - INFO - Saved checkpoint to outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-25 05:34:49,676 - INFO - New best validation loss: 1.7708, perplexity: 5.88 2025-11-25 05:36:15,179 - INFO - Epoch 1 Step 2010 (Global: 2010): loss=1.8014, ppl=6.06, grad_norm=0.89, lr=9.74e-05, throughput=5615 tok/s 2025-11-25 05:37:40,487 - INFO - Epoch 1 Step 2020 (Global: 2020): loss=1.7862, ppl=5.97, grad_norm=0.81, lr=9.73e-05, throughput=5627 tok/s 2025-11-25 05:39:05,717 - INFO - Epoch 1 Step 2030 (Global: 2030): loss=1.6233, ppl=5.07, grad_norm=0.81, lr=9.73e-05, throughput=5632 tok/s 2025-11-25 05:40:31,225 - INFO - Epoch 1 Step 2040 (Global: 2040): loss=1.9074, ppl=6.74, grad_norm=0.89, lr=9.72e-05, throughput=5614 tok/s 2025-11-25 05:41:56,823 - INFO - Epoch 1 Step 2050 (Global: 2050): loss=1.9198, ppl=6.82, grad_norm=0.93, lr=9.72e-05, throughput=5608 tok/s 2025-11-25 05:43:22,357 - INFO - Epoch 1 Step 2060 (Global: 2060): loss=2.1437, ppl=8.53, grad_norm=0.98, lr=9.71e-05, throughput=5612 tok/s 2025-11-25 05:44:47,824 - INFO - Epoch 1 Step 2070 (Global: 2070): loss=1.7525, ppl=5.77, grad_norm=0.90, lr=9.71e-05, throughput=5616 tok/s 2025-11-25 05:46:13,183 - INFO - Epoch 1 Step 2080 (Global: 2080): loss=1.7639, ppl=5.84, grad_norm=0.91, lr=9.70e-05, throughput=5623 tok/s 2025-11-25 05:47:38,550 - INFO - Epoch 1 Step 2090 (Global: 2090): loss=1.7961, ppl=6.03, grad_norm=0.88, lr=9.69e-05, throughput=5623 tok/s 2025-11-25 05:49:04,002 - INFO - Epoch 1 Step 2100 (Global: 2100): loss=1.7997, ppl=6.05, grad_norm=0.84, lr=9.69e-05, throughput=5617 tok/s 2025-11-25 05:50:29,616 - INFO - Epoch 1 Step 2110 (Global: 2110): loss=1.7276, ppl=5.63, grad_norm=0.81, lr=9.68e-05, throughput=5607 tok/s 2025-11-25 05:51:55,139 - INFO - Epoch 1 Step 2120 (Global: 2120): loss=2.0304, ppl=7.62, grad_norm=0.92, lr=9.68e-05, throughput=5613 tok/s 2025-11-25 05:53:20,652 - INFO - Epoch 1 Step 2130 (Global: 2130): loss=1.9752, ppl=7.21, grad_norm=0.89, lr=9.67e-05, throughput=5613 tok/s 2025-11-25 05:54:46,090 - INFO - Epoch 1 Step 2140 (Global: 2140): loss=1.7610, ppl=5.82, grad_norm=0.87, lr=9.66e-05, throughput=5618 tok/s 2025-11-25 05:56:11,564 - INFO - Epoch 1 Step 2150 (Global: 2150): loss=1.8456, ppl=6.33, grad_norm=0.86, lr=9.66e-05, throughput=5616 tok/s 2025-11-25 05:57:37,061 - INFO - Epoch 1 Step 2160 (Global: 2160): loss=1.6300, ppl=5.10, grad_norm=0.79, lr=9.65e-05, throughput=5614 tok/s 2025-11-25 05:59:02,503 - INFO - Epoch 1 Step 2170 (Global: 2170): loss=1.7589, ppl=5.81, grad_norm=0.86, lr=9.65e-05, throughput=5618 tok/s 2025-11-25 06:00:28,004 - INFO - Epoch 1 Step 2180 (Global: 2180): loss=1.7565, ppl=5.79, grad_norm=1.16, lr=9.64e-05, throughput=5614 tok/s 2025-11-25 06:01:53,899 - INFO - Epoch 1 Step 2190 (Global: 2190): loss=1.8956, ppl=6.66, grad_norm=0.88, lr=9.63e-05, throughput=5588 tok/s 2025-11-25 06:03:19,352 - INFO - Epoch 1 Step 2200 (Global: 2200): loss=1.9837, ppl=7.27, grad_norm=0.85, lr=9.63e-05, throughput=5617 tok/s 2025-11-25 06:04:44,426 - INFO - Epoch 1 Step 2210 (Global: 2210): loss=1.7161, ppl=5.56, grad_norm=0.82, lr=9.62e-05, throughput=5642 tok/s 2025-11-25 06:06:09,941 - INFO - Epoch 1 Step 2220 (Global: 2220): loss=1.8068, ppl=6.09, grad_norm=0.86, lr=9.61e-05, throughput=5613 tok/s 2025-11-25 06:07:34,858 - INFO - Epoch 1 Step 2230 (Global: 2230): loss=1.8079, ppl=6.10, grad_norm=0.90, lr=9.61e-05, throughput=5653 tok/s 2025-11-25 06:09:00,103 - INFO - Epoch 1 Step 2240 (Global: 2240): loss=1.8680, ppl=6.48, grad_norm=0.90, lr=9.60e-05, throughput=5631 tok/s 2025-11-25 06:10:25,312 - INFO - Epoch 1 Step 2250 (Global: 2250): loss=1.9221, ppl=6.84, grad_norm=0.98, lr=9.60e-05, throughput=5633 tok/s 2025-11-25 06:11:50,533 - INFO - Epoch 1 Step 2260 (Global: 2260): loss=1.6242, ppl=5.07, grad_norm=0.86, lr=9.59e-05, throughput=5632 tok/s 2025-11-25 06:13:15,772 - INFO - Epoch 1 Step 2270 (Global: 2270): loss=1.9041, ppl=6.71, grad_norm=0.89, lr=9.58e-05, throughput=5631 tok/s 2025-11-25 06:14:41,654 - INFO - Epoch 1 Step 2280 (Global: 2280): loss=1.9398, ppl=6.96, grad_norm=0.87, lr=9.58e-05, throughput=5589 tok/s 2025-11-25 06:16:07,047 - INFO - Epoch 1 Step 2290 (Global: 2290): loss=1.6327, ppl=5.12, grad_norm=0.79, lr=9.57e-05, throughput=5621 tok/s 2025-11-25 06:17:32,933 - INFO - Epoch 1 Step 2300 (Global: 2300): loss=1.7472, ppl=5.74, grad_norm=0.91, lr=9.56e-05, throughput=5589 tok/s 2025-11-25 06:18:58,429 - INFO - Epoch 1 Step 2310 (Global: 2310): loss=1.8766, ppl=6.53, grad_norm=0.89, lr=9.55e-05, throughput=5614 tok/s 2025-11-25 06:20:23,881 - INFO - Epoch 1 Step 2320 (Global: 2320): loss=1.7727, ppl=5.89, grad_norm=0.84, lr=9.55e-05, throughput=5617 tok/s 2025-11-25 06:21:49,773 - INFO - Epoch 1 Step 2330 (Global: 2330): loss=1.8637, ppl=6.45, grad_norm=0.91, lr=9.54e-05, throughput=5588 tok/s 2025-11-25 06:23:15,234 - INFO - Epoch 1 Step 2340 (Global: 2340): loss=1.7046, ppl=5.50, grad_norm=0.85, lr=9.53e-05, throughput=5617 tok/s 2025-11-25 06:24:40,723 - INFO - Epoch 1 Step 2350 (Global: 2350): loss=1.7105, ppl=5.53, grad_norm=0.82, lr=9.53e-05, throughput=5615 tok/s 2025-11-25 06:26:06,346 - INFO - Epoch 1 Step 2360 (Global: 2360): loss=1.6439, ppl=5.18, grad_norm=0.88, lr=9.52e-05, throughput=5606 tok/s 2025-11-25 06:27:31,962 - INFO - Epoch 1 Step 2370 (Global: 2370): loss=1.8425, ppl=6.31, grad_norm=0.89, lr=9.51e-05, throughput=5606 tok/s 2025-11-25 06:28:56,995 - INFO - Epoch 1 Step 2380 (Global: 2380): loss=1.8300, ppl=6.23, grad_norm=0.88, lr=9.51e-05, throughput=5645 tok/s 2025-11-25 06:30:22,544 - INFO - Epoch 1 Step 2390 (Global: 2390): loss=1.9113, ppl=6.76, grad_norm=0.85, lr=9.50e-05, throughput=5611 tok/s 2025-11-25 06:31:48,326 - INFO - Epoch 1 Step 2400 (Global: 2400): loss=1.6540, ppl=5.23, grad_norm=0.82, lr=9.49e-05, throughput=5596 tok/s 2025-11-25 06:33:13,983 - INFO - Epoch 1 Step 2410 (Global: 2410): loss=1.7148, ppl=5.56, grad_norm=0.86, lr=9.48e-05, throughput=5604 tok/s 2025-11-25 06:34:39,446 - INFO - Epoch 1 Step 2420 (Global: 2420): loss=1.6606, ppl=5.26, grad_norm=0.83, lr=9.48e-05, throughput=5616 tok/s 2025-11-25 06:36:04,871 - INFO - Epoch 1 Step 2430 (Global: 2430): loss=1.7477, ppl=5.74, grad_norm=0.82, lr=9.47e-05, throughput=5619 tok/s 2025-11-25 06:37:30,379 - INFO - Epoch 1 Step 2440 (Global: 2440): loss=1.5240, ppl=4.59, grad_norm=0.79, lr=9.46e-05, throughput=5614 tok/s 2025-11-25 06:38:55,936 - INFO - Epoch 1 Step 2450 (Global: 2450): loss=1.7176, ppl=5.57, grad_norm=0.79, lr=9.45e-05, throughput=5610 tok/s 2025-11-25 06:40:21,467 - INFO - Epoch 1 Step 2460 (Global: 2460): loss=1.7681, ppl=5.86, grad_norm=0.85, lr=9.45e-05, throughput=5612 tok/s 2025-11-25 06:41:47,099 - INFO - Epoch 1 Step 2470 (Global: 2470): loss=1.9410, ppl=6.97, grad_norm=0.88, lr=9.44e-05, throughput=5605 tok/s 2025-11-25 06:43:12,378 - INFO - Epoch 1 Step 2480 (Global: 2480): loss=1.8836, ppl=6.58, grad_norm=0.87, lr=9.43e-05, throughput=5629 tok/s 2025-11-25 06:44:37,905 - INFO - Epoch 1 Step 2490 (Global: 2490): loss=1.5592, ppl=4.75, grad_norm=0.80, lr=9.42e-05, throughput=5612 tok/s 2025-11-25 06:46:03,311 - INFO - Epoch 1 Step 2500 (Global: 2500): loss=1.7732, ppl=5.89, grad_norm=0.85, lr=9.41e-05, throughput=5620 tok/s 2025-11-25 06:47:28,870 - INFO - Epoch 1 Step 2510 (Global: 2510): loss=1.5417, ppl=4.67, grad_norm=0.85, lr=9.41e-05, throughput=5610 tok/s 2025-11-25 06:48:54,076 - INFO - Epoch 1 Step 2520 (Global: 2520): loss=1.7764, ppl=5.91, grad_norm=0.83, lr=9.40e-05, throughput=5633 tok/s 2025-11-25 06:50:19,273 - INFO - Epoch 1 Step 2530 (Global: 2530): loss=1.7286, ppl=5.63, grad_norm=0.88, lr=9.39e-05, throughput=5634 tok/s 2025-11-25 06:51:44,487 - INFO - Epoch 1 Step 2540 (Global: 2540): loss=1.7173, ppl=5.57, grad_norm=0.86, lr=9.38e-05, throughput=5633 tok/s 2025-11-25 06:53:09,421 - INFO - Epoch 1 Step 2550 (Global: 2550): loss=1.8001, ppl=6.05, grad_norm=0.84, lr=9.37e-05, throughput=5652 tok/s 2025-11-25 06:54:34,742 - INFO - Epoch 1 Step 2560 (Global: 2560): loss=1.8055, ppl=6.08, grad_norm=0.83, lr=9.37e-05, throughput=5626 tok/s 2025-11-25 06:56:00,178 - INFO - Epoch 1 Step 2570 (Global: 2570): loss=1.5133, ppl=4.54, grad_norm=0.77, lr=9.36e-05, throughput=5618 tok/s 2025-11-25 06:57:25,677 - INFO - Epoch 1 Step 2580 (Global: 2580): loss=1.9955, ppl=7.36, grad_norm=0.86, lr=9.35e-05, throughput=5614 tok/s 2025-11-25 06:58:51,148 - INFO - Epoch 1 Step 2590 (Global: 2590): loss=1.9100, ppl=6.75, grad_norm=0.89, lr=9.34e-05, throughput=5616 tok/s 2025-11-25 07:00:16,284 - INFO - Epoch 1 Step 2600 (Global: 2600): loss=1.8010, ppl=6.06, grad_norm=1.10, lr=9.33e-05, throughput=5638 tok/s 2025-11-25 07:01:41,863 - INFO - Epoch 1 Step 2610 (Global: 2610): loss=1.8540, ppl=6.39, grad_norm=0.84, lr=9.32e-05, throughput=5609 tok/s 2025-11-25 07:03:07,453 - INFO - Epoch 1 Step 2620 (Global: 2620): loss=1.6314, ppl=5.11, grad_norm=0.97, lr=9.32e-05, throughput=5608 tok/s 2025-11-25 07:04:33,092 - INFO - Epoch 1 Step 2630 (Global: 2630): loss=1.5286, ppl=4.61, grad_norm=0.91, lr=9.31e-05, throughput=5605 tok/s 2025-11-25 07:05:58,372 - INFO - Epoch 1 Step 2640 (Global: 2640): loss=1.6432, ppl=5.17, grad_norm=0.82, lr=9.30e-05, throughput=5629 tok/s 2025-11-25 07:07:23,971 - INFO - Epoch 1 Step 2650 (Global: 2650): loss=1.7133, ppl=5.55, grad_norm=0.86, lr=9.29e-05, throughput=5608 tok/s 2025-11-25 07:08:49,377 - INFO - Epoch 1 Step 2660 (Global: 2660): loss=1.8330, ppl=6.25, grad_norm=1.21, lr=9.28e-05, throughput=5620 tok/s 2025-11-25 07:10:14,946 - INFO - Epoch 1 Step 2670 (Global: 2670): loss=1.7252, ppl=5.61, grad_norm=0.81, lr=9.27e-05, throughput=5610 tok/s 2025-11-25 07:11:40,518 - INFO - Epoch 1 Step 2680 (Global: 2680): loss=1.5824, ppl=4.87, grad_norm=0.88, lr=9.26e-05, throughput=5609 tok/s 2025-11-25 07:13:06,337 - INFO - Epoch 1 Step 2690 (Global: 2690): loss=1.9637, ppl=7.13, grad_norm=0.91, lr=9.26e-05, throughput=5593 tok/s 2025-11-25 07:14:32,226 - INFO - Epoch 1 Step 2700 (Global: 2700): loss=1.5870, ppl=4.89, grad_norm=0.86, lr=9.25e-05, throughput=5589 tok/s 2025-11-25 07:15:57,678 - INFO - Epoch 1 Step 2710 (Global: 2710): loss=1.6953, ppl=5.45, grad_norm=0.80, lr=9.24e-05, throughput=5617 tok/s 2025-11-25 07:17:23,051 - INFO - Epoch 1 Step 2720 (Global: 2720): loss=1.8511, ppl=6.37, grad_norm=0.88, lr=9.23e-05, throughput=5622 tok/s 2025-11-25 07:18:48,539 - INFO - Epoch 1 Step 2730 (Global: 2730): loss=1.7794, ppl=5.93, grad_norm=0.88, lr=9.22e-05, throughput=5615 tok/s 2025-11-25 07:20:13,829 - INFO - Epoch 1 Step 2740 (Global: 2740): loss=1.7116, ppl=5.54, grad_norm=0.80, lr=9.21e-05, throughput=5628 tok/s 2025-11-25 07:21:39,488 - INFO - Epoch 1 Step 2750 (Global: 2750): loss=1.7277, ppl=5.63, grad_norm=0.79, lr=9.20e-05, throughput=5604 tok/s 2025-11-25 07:23:05,118 - INFO - Epoch 1 Step 2760 (Global: 2760): loss=1.6850, ppl=5.39, grad_norm=0.78, lr=9.19e-05, throughput=5606 tok/s 2025-11-25 07:24:30,699 - INFO - Epoch 1 Step 2770 (Global: 2770): loss=1.7564, ppl=5.79, grad_norm=0.84, lr=9.18e-05, throughput=5609 tok/s 2025-11-25 07:25:56,399 - INFO - Epoch 1 Step 2780 (Global: 2780): loss=1.9123, ppl=6.77, grad_norm=0.89, lr=9.17e-05, throughput=5601 tok/s 2025-11-25 07:27:22,016 - INFO - Epoch 1 Step 2790 (Global: 2790): loss=1.9135, ppl=6.78, grad_norm=0.86, lr=9.17e-05, throughput=5606 tok/s 2025-11-25 07:28:47,492 - INFO - Epoch 1 Step 2800 (Global: 2800): loss=1.8626, ppl=6.44, grad_norm=0.87, lr=9.16e-05, throughput=5616 tok/s 2025-11-25 07:30:13,127 - INFO - Epoch 1 Step 2810 (Global: 2810): loss=1.7163, ppl=5.56, grad_norm=0.83, lr=9.15e-05, throughput=5605 tok/s 2025-11-25 07:31:38,438 - INFO - Epoch 1 Step 2820 (Global: 2820): loss=1.7185, ppl=5.58, grad_norm=0.86, lr=9.14e-05, throughput=5626 tok/s 2025-11-25 07:33:03,573 - INFO - Epoch 1 Step 2830 (Global: 2830): loss=1.5456, ppl=4.69, grad_norm=0.81, lr=9.13e-05, throughput=5638 tok/s 2025-11-25 07:34:29,138 - INFO - Epoch 1 Step 2840 (Global: 2840): loss=1.8886, ppl=6.61, grad_norm=0.88, lr=9.12e-05, throughput=5610 tok/s 2025-11-25 07:35:54,429 - INFO - Epoch 1 Step 2850 (Global: 2850): loss=1.6860, ppl=5.40, grad_norm=0.79, lr=9.11e-05, throughput=5628 tok/s 2025-11-25 07:37:19,802 - INFO - Epoch 1 Step 2860 (Global: 2860): loss=1.7495, ppl=5.75, grad_norm=0.83, lr=9.10e-05, throughput=5622 tok/s 2025-11-25 07:38:45,281 - INFO - Epoch 1 Step 2870 (Global: 2870): loss=1.9410, ppl=6.97, grad_norm=0.86, lr=9.09e-05, throughput=5615 tok/s 2025-11-25 07:40:10,668 - INFO - Epoch 1 Step 2880 (Global: 2880): loss=1.5765, ppl=4.84, grad_norm=0.83, lr=9.08e-05, throughput=5622 tok/s 2025-11-25 07:41:36,076 - INFO - Epoch 1 Step 2890 (Global: 2890): loss=1.8185, ppl=6.16, grad_norm=0.85, lr=9.07e-05, throughput=5620 tok/s 2025-11-25 07:43:01,370 - INFO - Epoch 1 Step 2900 (Global: 2900): loss=1.7876, ppl=5.98, grad_norm=0.78, lr=9.06e-05, throughput=5628 tok/s 2025-11-25 07:44:26,588 - INFO - Epoch 1 Step 2910 (Global: 2910): loss=1.7054, ppl=5.50, grad_norm=0.78, lr=9.05e-05, throughput=5633 tok/s 2025-11-25 07:45:51,916 - INFO - Epoch 1 Step 2920 (Global: 2920): loss=1.8243, ppl=6.20, grad_norm=0.79, lr=9.04e-05, throughput=5625 tok/s 2025-11-25 07:47:17,185 - INFO - Epoch 1 Step 2930 (Global: 2930): loss=1.7070, ppl=5.51, grad_norm=0.83, lr=9.03e-05, throughput=5629 tok/s 2025-11-25 07:48:42,544 - INFO - Epoch 1 Step 2940 (Global: 2940): loss=1.6527, ppl=5.22, grad_norm=0.79, lr=9.02e-05, throughput=5623 tok/s 2025-11-25 07:50:07,987 - INFO - Epoch 1 Step 2950 (Global: 2950): loss=1.7534, ppl=5.77, grad_norm=0.87, lr=9.01e-05, throughput=5618 tok/s 2025-11-25 07:51:33,559 - INFO - Epoch 1 Step 2960 (Global: 2960): loss=1.7610, ppl=5.82, grad_norm=0.89, lr=9.00e-05, throughput=5609 tok/s 2025-11-25 07:52:58,939 - INFO - Epoch 1 Step 2970 (Global: 2970): loss=1.8890, ppl=6.61, grad_norm=0.87, lr=8.99e-05, throughput=5622 tok/s 2025-11-25 07:54:24,351 - INFO - Epoch 1 Step 2980 (Global: 2980): loss=1.5419, ppl=4.67, grad_norm=0.77, lr=8.98e-05, throughput=5620 tok/s 2025-11-25 07:55:49,710 - INFO - Epoch 1 Step 2990 (Global: 2990): loss=1.8618, ppl=6.44, grad_norm=0.80, lr=8.97e-05, throughput=5623 tok/s 2025-11-25 07:57:14,910 - INFO - Epoch 1 Step 3000 (Global: 3000): loss=1.5175, ppl=4.56, grad_norm=0.80, lr=8.96e-05, throughput=5634 tok/s 2025-11-25 07:58:40,254 - INFO - Epoch 1 Step 3010 (Global: 3010): loss=1.6520, ppl=5.22, grad_norm=0.80, lr=8.95e-05, throughput=5624 tok/s 2025-11-25 08:00:05,582 - INFO - Epoch 1 Step 3020 (Global: 3020): loss=1.6291, ppl=5.10, grad_norm=0.84, lr=8.94e-05, throughput=5625 tok/s 2025-11-25 08:01:30,878 - INFO - Epoch 1 Step 3030 (Global: 3030): loss=1.6958, ppl=5.45, grad_norm=0.83, lr=8.93e-05, throughput=5628 tok/s 2025-11-25 08:02:56,426 - INFO - Epoch 1 Step 3040 (Global: 3040): loss=1.8161, ppl=6.15, grad_norm=0.82, lr=8.92e-05, throughput=5611 tok/s 2025-11-25 08:04:21,721 - INFO - Epoch 1 Step 3050 (Global: 3050): loss=1.6153, ppl=5.03, grad_norm=0.76, lr=8.91e-05, throughput=5628 tok/s 2025-11-25 08:05:47,227 - INFO - Epoch 1 Step 3060 (Global: 3060): loss=1.6548, ppl=5.23, grad_norm=0.80, lr=8.90e-05, throughput=5614 tok/s 2025-11-25 08:07:12,622 - INFO - Epoch 1 Step 3070 (Global: 3070): loss=1.6244, ppl=5.08, grad_norm=0.80, lr=8.89e-05, throughput=5621 tok/s 2025-11-25 08:08:38,040 - INFO - Epoch 1 Step 3080 (Global: 3080): loss=1.6376, ppl=5.14, grad_norm=0.80, lr=8.88e-05, throughput=5620 tok/s 2025-11-25 08:10:03,555 - INFO - Epoch 1 Step 3090 (Global: 3090): loss=1.6172, ppl=5.04, grad_norm=0.79, lr=8.87e-05, throughput=5613 tok/s 2025-11-25 08:11:29,016 - INFO - Epoch 1 Step 3100 (Global: 3100): loss=1.9017, ppl=6.70, grad_norm=0.86, lr=8.86e-05, throughput=5617 tok/s 2025-11-25 08:12:54,510 - INFO - Epoch 1 Step 3110 (Global: 3110): loss=1.9429, ppl=6.98, grad_norm=0.95, lr=8.85e-05, throughput=5614 tok/s 2025-11-25 08:14:20,223 - INFO - Epoch 1 Step 3120 (Global: 3120): loss=1.4811, ppl=4.40, grad_norm=0.76, lr=8.84e-05, throughput=5600 tok/s 2025-11-25 08:15:45,837 - INFO - Epoch 1 Step 3130 (Global: 3130): loss=1.5052, ppl=4.51, grad_norm=0.80, lr=8.82e-05, throughput=5607 tok/s 2025-11-25 08:17:11,430 - INFO - Epoch 1 Step 3140 (Global: 3140): loss=1.5525, ppl=4.72, grad_norm=0.74, lr=8.81e-05, throughput=5608 tok/s 2025-11-25 08:18:36,960 - INFO - Epoch 1 Step 3150 (Global: 3150): loss=1.7481, ppl=5.74, grad_norm=0.80, lr=8.80e-05, throughput=5612 tok/s 2025-11-25 08:20:02,725 - INFO - Epoch 1 Step 3160 (Global: 3160): loss=1.8001, ppl=6.05, grad_norm=0.83, lr=8.79e-05, throughput=5597 tok/s 2025-11-25 08:21:28,140 - INFO - Epoch 1 Step 3170 (Global: 3170): loss=1.8581, ppl=6.41, grad_norm=0.93, lr=8.78e-05, throughput=5620 tok/s 2025-11-25 08:22:53,631 - INFO - Epoch 1 Step 3180 (Global: 3180): loss=1.8042, ppl=6.08, grad_norm=0.83, lr=8.77e-05, throughput=5615 tok/s 2025-11-25 08:24:19,233 - INFO - Epoch 1 Step 3190 (Global: 3190): loss=1.8239, ppl=6.20, grad_norm=0.79, lr=8.76e-05, throughput=5607 tok/s 2025-11-25 08:25:44,730 - INFO - Epoch 1 Step 3200 (Global: 3200): loss=1.6374, ppl=5.14, grad_norm=0.77, lr=8.75e-05, throughput=5614 tok/s 2025-11-25 08:27:10,133 - INFO - Epoch 1 Step 3210 (Global: 3210): loss=1.7026, ppl=5.49, grad_norm=0.79, lr=8.74e-05, throughput=5620 tok/s 2025-11-25 08:28:35,425 - INFO - Epoch 1 Step 3220 (Global: 3220): loss=1.5529, ppl=4.73, grad_norm=0.81, lr=8.73e-05, throughput=5628 tok/s 2025-11-25 08:30:00,750 - INFO - Epoch 1 Step 3230 (Global: 3230): loss=1.8924, ppl=6.64, grad_norm=0.86, lr=8.71e-05, throughput=5626 tok/s 2025-11-25 08:31:25,899 - INFO - Epoch 1 Step 3240 (Global: 3240): loss=1.3951, ppl=4.04, grad_norm=0.79, lr=8.70e-05, throughput=5637 tok/s 2025-11-25 08:32:51,101 - INFO - Epoch 1 Step 3250 (Global: 3250): loss=1.5087, ppl=4.52, grad_norm=0.79, lr=8.69e-05, throughput=5634 tok/s 2025-11-25 08:34:16,394 - INFO - Epoch 1 Step 3260 (Global: 3260): loss=1.6425, ppl=5.17, grad_norm=0.80, lr=8.68e-05, throughput=5628 tok/s 2025-11-25 08:35:41,611 - INFO - Epoch 1 Step 3270 (Global: 3270): loss=1.5754, ppl=4.83, grad_norm=0.85, lr=8.67e-05, throughput=5633 tok/s 2025-11-25 08:37:06,950 - INFO - Epoch 1 Step 3280 (Global: 3280): loss=1.8647, ppl=6.45, grad_norm=0.79, lr=8.66e-05, throughput=5625 tok/s 2025-11-25 08:38:31,965 - INFO - Epoch 1 Step 3290 (Global: 3290): loss=1.6244, ppl=5.08, grad_norm=0.83, lr=8.65e-05, throughput=5646 tok/s 2025-11-25 08:39:57,335 - INFO - Epoch 1 Step 3300 (Global: 3300): loss=1.8700, ppl=6.49, grad_norm=0.80, lr=8.63e-05, throughput=5623 tok/s 2025-11-25 08:41:22,767 - INFO - Epoch 1 Step 3310 (Global: 3310): loss=1.7535, ppl=5.78, grad_norm=0.79, lr=8.62e-05, throughput=5619 tok/s 2025-11-25 08:42:48,000 - INFO - Epoch 1 Step 3320 (Global: 3320): loss=1.6899, ppl=5.42, grad_norm=0.83, lr=8.61e-05, throughput=5632 tok/s 2025-11-25 08:44:13,179 - INFO - Epoch 1 Step 3330 (Global: 3330): loss=1.7067, ppl=5.51, grad_norm=0.81, lr=8.60e-05, throughput=5635 tok/s 2025-11-25 08:45:38,755 - INFO - Epoch 1 Step 3340 (Global: 3340): loss=1.7654, ppl=5.84, grad_norm=0.79, lr=8.59e-05, throughput=5609 tok/s 2025-11-25 08:47:04,020 - INFO - Epoch 1 Step 3350 (Global: 3350): loss=1.5397, ppl=4.66, grad_norm=0.75, lr=8.58e-05, throughput=5630 tok/s 2025-11-25 08:48:29,016 - INFO - Epoch 1 Step 3360 (Global: 3360): loss=1.8912, ppl=6.63, grad_norm=0.82, lr=8.57e-05, throughput=5647 tok/s 2025-11-25 08:49:54,324 - INFO - Epoch 1 Step 3370 (Global: 3370): loss=1.8350, ppl=6.27, grad_norm=0.83, lr=8.55e-05, throughput=5627 tok/s 2025-11-25 08:51:19,558 - INFO - Epoch 1 Step 3380 (Global: 3380): loss=1.5150, ppl=4.55, grad_norm=0.78, lr=8.54e-05, throughput=5632 tok/s 2025-11-25 08:52:44,993 - INFO - Epoch 1 Step 3390 (Global: 3390): loss=1.5484, ppl=4.70, grad_norm=0.77, lr=8.53e-05, throughput=5618 tok/s 2025-11-25 08:54:10,437 - INFO - Epoch 1 Step 3400 (Global: 3400): loss=1.7250, ppl=5.61, grad_norm=0.80, lr=8.52e-05, throughput=5618 tok/s 2025-11-25 08:55:35,781 - INFO - Epoch 1 Step 3410 (Global: 3410): loss=1.7887, ppl=5.98, grad_norm=0.80, lr=8.51e-05, throughput=5624 tok/s 2025-11-25 08:57:01,159 - INFO - Epoch 1 Step 3420 (Global: 3420): loss=1.8730, ppl=6.51, grad_norm=0.82, lr=8.49e-05, throughput=5622 tok/s 2025-11-25 08:58:26,693 - INFO - Epoch 1 Step 3430 (Global: 3430): loss=1.5435, ppl=4.68, grad_norm=0.79, lr=8.48e-05, throughput=5612 tok/s 2025-11-25 08:59:51,900 - INFO - Epoch 1 Step 3440 (Global: 3440): loss=1.7891, ppl=5.98, grad_norm=0.78, lr=8.47e-05, throughput=5633 tok/s 2025-11-25 09:01:17,223 - INFO - Epoch 1 Step 3450 (Global: 3450): loss=1.6968, ppl=5.46, grad_norm=0.79, lr=8.46e-05, throughput=5626 tok/s 2025-11-25 09:02:42,504 - INFO - Epoch 1 Step 3460 (Global: 3460): loss=1.6418, ppl=5.16, grad_norm=0.76, lr=8.45e-05, throughput=5629 tok/s 2025-11-25 09:04:07,741 - INFO - Epoch 1 Step 3470 (Global: 3470): loss=1.7790, ppl=5.92, grad_norm=0.80, lr=8.43e-05, throughput=5631 tok/s 2025-11-25 09:05:33,102 - INFO - Epoch 1 Step 3480 (Global: 3480): loss=1.6562, ppl=5.24, grad_norm=0.87, lr=8.42e-05, throughput=5623 tok/s 2025-11-25 09:06:58,699 - INFO - Epoch 1 Step 3490 (Global: 3490): loss=1.7129, ppl=5.54, grad_norm=0.80, lr=8.41e-05, throughput=5608 tok/s 2025-11-25 09:08:23,898 - INFO - Epoch 1 Step 3500 (Global: 3500): loss=1.5556, ppl=4.74, grad_norm=0.82, lr=8.40e-05, throughput=5634 tok/s 2025-11-25 09:09:49,649 - INFO - Epoch 1 Step 3510 (Global: 3510): loss=1.6575, ppl=5.25, grad_norm=0.88, lr=8.38e-05, throughput=5598 tok/s 2025-11-25 09:11:15,144 - INFO - Epoch 1 Step 3520 (Global: 3520): loss=1.6282, ppl=5.09, grad_norm=0.80, lr=8.37e-05, throughput=5614 tok/s 2025-11-25 09:12:40,434 - INFO - Epoch 1 Step 3530 (Global: 3530): loss=1.9275, ppl=6.87, grad_norm=0.84, lr=8.36e-05, throughput=5628 tok/s 2025-11-25 09:14:05,606 - INFO - Epoch 1 Step 3540 (Global: 3540): loss=1.7241, ppl=5.61, grad_norm=0.79, lr=8.35e-05, throughput=5636 tok/s 2025-11-25 09:15:30,904 - INFO - Epoch 1 Step 3550 (Global: 3550): loss=1.8276, ppl=6.22, grad_norm=0.80, lr=8.33e-05, throughput=5627 tok/s 2025-11-25 09:16:56,166 - INFO - Epoch 1 Step 3560 (Global: 3560): loss=1.6215, ppl=5.06, grad_norm=0.76, lr=8.32e-05, throughput=5630 tok/s 2025-11-25 09:18:21,654 - INFO - Epoch 1 Step 3570 (Global: 3570): loss=1.6981, ppl=5.46, grad_norm=0.81, lr=8.31e-05, throughput=5615 tok/s 2025-11-25 09:19:47,182 - INFO - Epoch 1 Step 3580 (Global: 3580): loss=1.5547, ppl=4.73, grad_norm=0.74, lr=8.30e-05, throughput=5612 tok/s 2025-11-25 09:21:12,699 - INFO - Epoch 1 Step 3590 (Global: 3590): loss=1.7488, ppl=5.75, grad_norm=0.92, lr=8.28e-05, throughput=5613 tok/s 2025-11-25 09:22:37,830 - INFO - Epoch 1 Step 3600 (Global: 3600): loss=1.9831, ppl=7.27, grad_norm=0.83, lr=8.27e-05, throughput=5638 tok/s 2025-11-25 09:24:03,365 - INFO - Epoch 1 Step 3610 (Global: 3610): loss=1.8476, ppl=6.34, grad_norm=0.85, lr=8.26e-05, throughput=5612 tok/s 2025-11-25 09:25:28,900 - INFO - Epoch 1 Step 3620 (Global: 3620): loss=1.6584, ppl=5.25, grad_norm=0.77, lr=8.25e-05, throughput=5612 tok/s 2025-11-25 09:26:54,345 - INFO - Epoch 1 Step 3630 (Global: 3630): loss=1.8267, ppl=6.21, grad_norm=0.80, lr=8.23e-05, throughput=5618 tok/s 2025-11-25 09:28:19,653 - INFO - Epoch 1 Step 3640 (Global: 3640): loss=1.7252, ppl=5.61, grad_norm=0.81, lr=8.22e-05, throughput=5627 tok/s 2025-11-25 09:29:45,017 - INFO - Epoch 1 Step 3650 (Global: 3650): loss=1.6242, ppl=5.07, grad_norm=0.75, lr=8.21e-05, throughput=5623 tok/s 2025-11-25 09:31:10,275 - INFO - Epoch 1 Step 3660 (Global: 3660): loss=1.7794, ppl=5.93, grad_norm=0.80, lr=8.20e-05, throughput=5630 tok/s 2025-11-25 09:32:35,531 - INFO - Epoch 1 Step 3670 (Global: 3670): loss=1.6125, ppl=5.02, grad_norm=0.77, lr=8.18e-05, throughput=5630 tok/s 2025-11-25 09:34:00,965 - INFO - Epoch 1 Step 3680 (Global: 3680): loss=1.8872, ppl=6.60, grad_norm=0.80, lr=8.17e-05, throughput=5618 tok/s 2025-11-25 09:35:26,332 - INFO - Epoch 1 Step 3690 (Global: 3690): loss=1.7317, ppl=5.65, grad_norm=0.79, lr=8.16e-05, throughput=5623 tok/s 2025-11-25 09:36:51,470 - INFO - Epoch 1 Step 3700 (Global: 3700): loss=1.4693, ppl=4.35, grad_norm=0.89, lr=8.14e-05, throughput=5638 tok/s 2025-11-25 09:38:16,320 - INFO - Epoch 1 Step 3710 (Global: 3710): loss=1.6786, ppl=5.36, grad_norm=0.77, lr=8.13e-05, throughput=5657 tok/s 2025-11-25 09:39:41,458 - INFO - Epoch 1 Step 3720 (Global: 3720): loss=1.4800, ppl=4.39, grad_norm=0.76, lr=8.12e-05, throughput=5638 tok/s 2025-11-25 09:41:06,724 - INFO - Epoch 1 Step 3730 (Global: 3730): loss=1.6805, ppl=5.37, grad_norm=0.91, lr=8.10e-05, throughput=5630 tok/s 2025-11-25 09:42:32,080 - INFO - Epoch 1 Step 3740 (Global: 3740): loss=1.6914, ppl=5.43, grad_norm=0.90, lr=8.09e-05, throughput=5624 tok/s 2025-11-25 09:43:57,117 - INFO - Epoch 1 Step 3750 (Global: 3750): loss=1.7699, ppl=5.87, grad_norm=0.82, lr=8.08e-05, throughput=5645 tok/s 2025-11-25 09:45:22,607 - INFO - Epoch 1 Step 3760 (Global: 3760): loss=1.8306, ppl=6.24, grad_norm=0.81, lr=8.06e-05, throughput=5615 tok/s 2025-11-25 09:46:47,859 - INFO - Epoch 1 Step 3770 (Global: 3770): loss=1.4226, ppl=4.15, grad_norm=0.73, lr=8.05e-05, throughput=5630 tok/s 2025-11-25 09:48:12,742 - INFO - Epoch 1 Step 3780 (Global: 3780): loss=1.8056, ppl=6.08, grad_norm=0.77, lr=8.04e-05, throughput=5655 tok/s 2025-11-25 09:49:37,826 - INFO - Epoch 1 Step 3790 (Global: 3790): loss=1.7147, ppl=5.56, grad_norm=0.84, lr=8.02e-05, throughput=5642 tok/s 2025-11-25 09:51:03,057 - INFO - Epoch 1 Step 3800 (Global: 3800): loss=1.5798, ppl=4.85, grad_norm=0.90, lr=8.01e-05, throughput=5632 tok/s 2025-11-25 09:52:28,145 - INFO - Epoch 1 Step 3810 (Global: 3810): loss=1.6436, ppl=5.17, grad_norm=0.80, lr=8.00e-05, throughput=5641 tok/s 2025-11-25 09:53:53,042 - INFO - Epoch 1 Step 3820 (Global: 3820): loss=1.6763, ppl=5.35, grad_norm=0.81, lr=7.98e-05, throughput=5654 tok/s 2025-11-25 09:55:18,356 - INFO - Epoch 1 Step 3830 (Global: 3830): loss=1.4641, ppl=4.32, grad_norm=0.75, lr=7.97e-05, throughput=5626 tok/s 2025-11-25 09:56:43,467 - INFO - Epoch 1 Step 3840 (Global: 3840): loss=1.6474, ppl=5.19, grad_norm=0.83, lr=7.96e-05, throughput=5640 tok/s 2025-11-25 09:58:08,265 - INFO - Epoch 1 Step 3850 (Global: 3850): loss=1.8154, ppl=6.14, grad_norm=0.76, lr=7.94e-05, throughput=5661 tok/s 2025-11-25 09:59:33,402 - INFO - Epoch 1 Step 3860 (Global: 3860): loss=1.6564, ppl=5.24, grad_norm=0.83, lr=7.93e-05, throughput=5638 tok/s 2025-11-25 10:00:58,476 - INFO - Epoch 1 Step 3870 (Global: 3870): loss=1.7443, ppl=5.72, grad_norm=0.81, lr=7.92e-05, throughput=5642 tok/s 2025-11-25 10:02:23,741 - INFO - Epoch 1 Step 3880 (Global: 3880): loss=1.5397, ppl=4.66, grad_norm=0.87, lr=7.90e-05, throughput=5630 tok/s 2025-11-25 10:03:49,053 - INFO - Epoch 1 Step 3890 (Global: 3890): loss=1.6971, ppl=5.46, grad_norm=0.83, lr=7.89e-05, throughput=5626 tok/s 2025-11-25 10:05:14,962 - INFO - Epoch 1 Step 3900 (Global: 3900): loss=1.7732, ppl=5.89, grad_norm=0.77, lr=7.88e-05, throughput=5587 tok/s 2025-11-25 10:06:40,365 - INFO - Epoch 1 Step 3910 (Global: 3910): loss=1.7030, ppl=5.49, grad_norm=0.84, lr=7.86e-05, throughput=5620 tok/s 2025-11-25 10:08:06,529 - INFO - Epoch 1 Step 3920 (Global: 3920): loss=1.7164, ppl=5.56, grad_norm=0.76, lr=7.85e-05, throughput=5571 tok/s 2025-11-25 10:09:31,975 - INFO - Epoch 1 Step 3930 (Global: 3930): loss=1.4968, ppl=4.47, grad_norm=0.78, lr=7.83e-05, throughput=5618 tok/s 2025-11-25 10:10:57,608 - INFO - Epoch 1 Step 3940 (Global: 3940): loss=1.6709, ppl=5.32, grad_norm=0.82, lr=7.82e-05, throughput=5605 tok/s 2025-11-25 10:12:23,381 - INFO - Epoch 1 Step 3950 (Global: 3950): loss=1.5737, ppl=4.82, grad_norm=0.80, lr=7.81e-05, throughput=5596 tok/s 2025-11-25 10:13:48,815 - INFO - Epoch 1 Step 3960 (Global: 3960): loss=1.6819, ppl=5.38, grad_norm=0.77, lr=7.79e-05, throughput=5618 tok/s 2025-11-25 10:15:14,253 - INFO - Epoch 1 Step 3970 (Global: 3970): loss=1.6219, ppl=5.06, grad_norm=0.80, lr=7.78e-05, throughput=5618 tok/s 2025-11-25 10:16:39,753 - INFO - Epoch 1 Step 3980 (Global: 3980): loss=1.9807, ppl=7.25, grad_norm=0.82, lr=7.77e-05, throughput=5614 tok/s 2025-11-25 10:18:05,215 - INFO - Epoch 1 Step 3990 (Global: 3990): loss=1.6599, ppl=5.26, grad_norm=0.78, lr=7.75e-05, throughput=5617 tok/s 2025-11-25 10:19:30,850 - INFO - Epoch 1 Step 4000 (Global: 4000): loss=1.6043, ppl=4.97, grad_norm=0.75, lr=7.74e-05, throughput=5605 tok/s 2025-11-25 10:19:30,851 - INFO - Running validation at step 4000... 2025-11-25 10:23:35,017 - INFO - Validation loss: 1.7096, perplexity: 5.53 2025-11-25 10:23:59,973 - INFO - Saved checkpoint to outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-25 10:23:59,980 - INFO - New best validation loss: 1.7096, perplexity: 5.53 2025-11-25 10:25:25,740 - INFO - Epoch 1 Step 4010 (Global: 4010): loss=1.7749, ppl=5.90, grad_norm=0.77, lr=7.72e-05, throughput=5598 tok/s 2025-11-25 10:26:51,133 - INFO - Epoch 1 Step 4020 (Global: 4020): loss=1.7382, ppl=5.69, grad_norm=0.75, lr=7.71e-05, throughput=5621 tok/s 2025-11-25 10:28:16,484 - INFO - Epoch 1 Step 4030 (Global: 4030): loss=1.7894, ppl=5.99, grad_norm=0.80, lr=7.70e-05, throughput=5624 tok/s 2025-11-25 10:29:41,644 - INFO - Epoch 1 Step 4040 (Global: 4040): loss=1.7086, ppl=5.52, grad_norm=0.77, lr=7.68e-05, throughput=5637 tok/s 2025-11-25 10:31:06,782 - INFO - Epoch 1 Step 4050 (Global: 4050): loss=1.5950, ppl=4.93, grad_norm=0.75, lr=7.67e-05, throughput=5638 tok/s 2025-11-25 10:32:31,754 - INFO - Epoch 1 Step 4060 (Global: 4060): loss=1.8900, ppl=6.62, grad_norm=0.80, lr=7.65e-05, throughput=5649 tok/s 2025-11-25 10:33:56,909 - INFO - Epoch 1 Step 4070 (Global: 4070): loss=1.8688, ppl=6.48, grad_norm=0.83, lr=7.64e-05, throughput=5637 tok/s 2025-11-25 10:35:22,126 - INFO - Epoch 1 Step 4080 (Global: 4080): loss=1.5845, ppl=4.88, grad_norm=0.77, lr=7.62e-05, throughput=5633 tok/s 2025-11-25 10:36:47,695 - INFO - Epoch 1 Step 4090 (Global: 4090): loss=1.8640, ppl=6.45, grad_norm=0.84, lr=7.61e-05, throughput=5610 tok/s 2025-11-25 10:38:12,614 - INFO - Epoch 1 Step 4100 (Global: 4100): loss=1.7667, ppl=5.85, grad_norm=0.79, lr=7.60e-05, throughput=5652 tok/s 2025-11-25 10:39:38,079 - INFO - Epoch 1 Step 4110 (Global: 4110): loss=1.8898, ppl=6.62, grad_norm=0.82, lr=7.58e-05, throughput=5616 tok/s 2025-11-25 10:41:02,963 - INFO - Epoch 1 Step 4120 (Global: 4120): loss=2.0148, ppl=7.50, grad_norm=0.82, lr=7.57e-05, throughput=5655 tok/s 2025-11-25 10:42:28,173 - INFO - Epoch 1 Step 4130 (Global: 4130): loss=1.9070, ppl=6.73, grad_norm=0.79, lr=7.55e-05, throughput=5633 tok/s 2025-11-25 10:43:53,364 - INFO - Epoch 1 Step 4140 (Global: 4140): loss=1.6744, ppl=5.34, grad_norm=0.79, lr=7.54e-05, throughput=5634 tok/s 2025-11-25 10:45:18,582 - INFO - Epoch 1 Step 4150 (Global: 4150): loss=1.7543, ppl=5.78, grad_norm=0.81, lr=7.52e-05, throughput=5633 tok/s 2025-11-25 10:46:43,781 - INFO - Epoch 1 Step 4160 (Global: 4160): loss=1.8022, ppl=6.06, grad_norm=0.79, lr=7.51e-05, throughput=5634 tok/s 2025-11-25 10:48:09,007 - INFO - Epoch 1 Step 4170 (Global: 4170): loss=1.5965, ppl=4.94, grad_norm=0.72, lr=7.49e-05, throughput=5632 tok/s 2025-11-25 10:49:34,184 - INFO - Epoch 1 Step 4180 (Global: 4180): loss=1.8346, ppl=6.26, grad_norm=0.82, lr=7.48e-05, throughput=5635 tok/s 2025-11-25 10:50:59,525 - INFO - Epoch 1 Step 4190 (Global: 4190): loss=1.5920, ppl=4.91, grad_norm=0.83, lr=7.47e-05, throughput=5625 tok/s 2025-11-25 10:52:24,523 - INFO - Epoch 1 Step 4200 (Global: 4200): loss=1.9751, ppl=7.21, grad_norm=0.81, lr=7.45e-05, throughput=5647 tok/s 2025-11-25 10:53:50,084 - INFO - Epoch 1 Step 4210 (Global: 4210): loss=1.8901, ppl=6.62, grad_norm=0.83, lr=7.44e-05, throughput=5610 tok/s 2025-11-25 10:55:15,716 - INFO - Epoch 1 Step 4220 (Global: 4220): loss=1.9228, ppl=6.84, grad_norm=0.84, lr=7.42e-05, throughput=5605 tok/s 2025-11-25 10:56:41,185 - INFO - Epoch 1 Step 4230 (Global: 4230): loss=1.7372, ppl=5.68, grad_norm=0.80, lr=7.41e-05, throughput=5616 tok/s 2025-11-25 10:58:06,161 - INFO - Epoch 1 Step 4240 (Global: 4240): loss=1.7501, ppl=5.76, grad_norm=0.81, lr=7.39e-05, throughput=5649 tok/s 2025-11-25 10:59:31,376 - INFO - Epoch 1 Step 4250 (Global: 4250): loss=1.8710, ppl=6.49, grad_norm=0.81, lr=7.38e-05, throughput=5633 tok/s 2025-11-25 11:00:56,619 - INFO - Epoch 1 Step 4260 (Global: 4260): loss=1.7605, ppl=5.82, grad_norm=0.81, lr=7.36e-05, throughput=5631 tok/s 2025-11-25 11:02:21,801 - INFO - Epoch 1 Step 4270 (Global: 4270): loss=1.6617, ppl=5.27, grad_norm=0.76, lr=7.35e-05, throughput=5635 tok/s 2025-11-25 11:03:47,332 - INFO - Epoch 1 Step 4280 (Global: 4280): loss=1.7846, ppl=5.96, grad_norm=0.80, lr=7.33e-05, throughput=5612 tok/s 2025-11-25 11:05:12,834 - INFO - Epoch 1 Step 4290 (Global: 4290): loss=1.6536, ppl=5.23, grad_norm=0.77, lr=7.32e-05, throughput=5614 tok/s 2025-11-25 11:06:38,462 - INFO - Epoch 1 Step 4300 (Global: 4300): loss=1.8076, ppl=6.10, grad_norm=0.75, lr=7.30e-05, throughput=5606 tok/s 2025-11-25 11:08:03,815 - INFO - Epoch 1 Step 4310 (Global: 4310): loss=1.4343, ppl=4.20, grad_norm=0.74, lr=7.29e-05, throughput=5624 tok/s 2025-11-25 11:09:29,221 - INFO - Epoch 1 Step 4320 (Global: 4320): loss=1.5468, ppl=4.70, grad_norm=0.79, lr=7.27e-05, throughput=5620 tok/s 2025-11-25 11:10:54,451 - INFO - Epoch 1 Step 4330 (Global: 4330): loss=1.6367, ppl=5.14, grad_norm=0.77, lr=7.26e-05, throughput=5632 tok/s 2025-11-25 11:12:19,774 - INFO - Epoch 1 Step 4340 (Global: 4340): loss=1.4442, ppl=4.24, grad_norm=0.71, lr=7.24e-05, throughput=5626 tok/s 2025-11-25 11:13:45,029 - INFO - Epoch 1 Step 4350 (Global: 4350): loss=1.8365, ppl=6.27, grad_norm=0.85, lr=7.23e-05, throughput=5630 tok/s 2025-11-25 11:15:10,389 - INFO - Epoch 1 Step 4360 (Global: 4360): loss=1.5574, ppl=4.75, grad_norm=0.73, lr=7.21e-05, throughput=5623 tok/s 2025-11-25 11:16:35,782 - INFO - Epoch 1 Step 4370 (Global: 4370): loss=1.8073, ppl=6.09, grad_norm=0.80, lr=7.20e-05, throughput=5621 tok/s 2025-11-25 11:18:00,831 - INFO - Epoch 1 Step 4380 (Global: 4380): loss=1.6676, ppl=5.30, grad_norm=0.78, lr=7.18e-05, throughput=5644 tok/s 2025-11-25 11:19:26,205 - INFO - Epoch 1 Step 4390 (Global: 4390): loss=1.7534, ppl=5.77, grad_norm=0.78, lr=7.17e-05, throughput=5622 tok/s 2025-11-25 11:20:51,673 - INFO - Epoch 1 Step 4400 (Global: 4400): loss=1.6369, ppl=5.14, grad_norm=0.78, lr=7.15e-05, throughput=5616 tok/s 2025-11-25 11:22:16,993 - INFO - Epoch 1 Step 4410 (Global: 4410): loss=1.5966, ppl=4.94, grad_norm=0.76, lr=7.14e-05, throughput=5626 tok/s 2025-11-25 11:23:42,331 - INFO - Epoch 1 Step 4420 (Global: 4420): loss=1.8127, ppl=6.13, grad_norm=0.77, lr=7.12e-05, throughput=5625 tok/s 2025-11-25 11:25:07,714 - INFO - Epoch 1 Step 4430 (Global: 4430): loss=1.8463, ppl=6.34, grad_norm=0.81, lr=7.11e-05, throughput=5622 tok/s 2025-11-25 11:26:33,040 - INFO - Epoch 1 Step 4440 (Global: 4440): loss=1.8151, ppl=6.14, grad_norm=0.81, lr=7.09e-05, throughput=5626 tok/s 2025-11-25 11:27:58,231 - INFO - Epoch 1 Step 4450 (Global: 4450): loss=1.6694, ppl=5.31, grad_norm=0.77, lr=7.08e-05, throughput=5634 tok/s 2025-11-25 11:29:23,647 - INFO - Epoch 1 Step 4460 (Global: 4460): loss=1.6534, ppl=5.22, grad_norm=0.80, lr=7.06e-05, throughput=5620 tok/s 2025-11-25 11:30:48,897 - INFO - Epoch 1 Step 4470 (Global: 4470): loss=1.5398, ppl=4.66, grad_norm=0.79, lr=7.05e-05, throughput=5631 tok/s 2025-11-25 11:32:13,821 - INFO - Epoch 1 Step 4480 (Global: 4480): loss=1.8224, ppl=6.19, grad_norm=0.82, lr=7.03e-05, throughput=5652 tok/s 2025-11-25 11:33:38,941 - INFO - Epoch 1 Step 4490 (Global: 4490): loss=1.5493, ppl=4.71, grad_norm=0.86, lr=7.02e-05, throughput=5639 tok/s 2025-11-25 11:35:04,201 - INFO - Epoch 1 Step 4500 (Global: 4500): loss=1.5994, ppl=4.95, grad_norm=0.76, lr=7.00e-05, throughput=5630 tok/s 2025-11-25 11:36:29,327 - INFO - Epoch 1 Step 4510 (Global: 4510): loss=1.5286, ppl=4.61, grad_norm=0.73, lr=6.99e-05, throughput=5639 tok/s 2025-11-25 11:37:54,594 - INFO - Epoch 1 Step 4520 (Global: 4520): loss=1.6971, ppl=5.46, grad_norm=0.81, lr=6.97e-05, throughput=5629 tok/s 2025-11-25 11:39:19,978 - INFO - Epoch 1 Step 4530 (Global: 4530): loss=1.7635, ppl=5.83, grad_norm=0.78, lr=6.96e-05, throughput=5622 tok/s 2025-11-25 11:40:45,030 - INFO - Epoch 1 Step 4540 (Global: 4540): loss=1.7263, ppl=5.62, grad_norm=0.78, lr=6.94e-05, throughput=5644 tok/s 2025-11-25 11:42:10,337 - INFO - Epoch 1 Step 4550 (Global: 4550): loss=1.9233, ppl=6.84, grad_norm=0.79, lr=6.92e-05, throughput=5627 tok/s 2025-11-25 11:43:35,758 - INFO - Epoch 1 Step 4560 (Global: 4560): loss=1.4856, ppl=4.42, grad_norm=0.73, lr=6.91e-05, throughput=5619 tok/s 2025-11-25 11:45:00,966 - INFO - Epoch 1 Step 4570 (Global: 4570): loss=1.9084, ppl=6.74, grad_norm=0.82, lr=6.89e-05, throughput=5633 tok/s 2025-11-25 11:46:26,447 - INFO - Epoch 1 Step 4580 (Global: 4580): loss=1.6554, ppl=5.24, grad_norm=0.78, lr=6.88e-05, throughput=5615 tok/s 2025-11-25 11:47:51,830 - INFO - Epoch 1 Step 4590 (Global: 4590): loss=1.5989, ppl=4.95, grad_norm=0.75, lr=6.86e-05, throughput=5622 tok/s 2025-11-25 11:49:17,642 - INFO - Epoch 1 Step 4600 (Global: 4600): loss=1.7366, ppl=5.68, grad_norm=0.78, lr=6.85e-05, throughput=5594 tok/s 2025-11-25 11:50:43,271 - INFO - Epoch 1 Step 4610 (Global: 4610): loss=1.6768, ppl=5.35, grad_norm=0.95, lr=6.83e-05, throughput=5606 tok/s 2025-11-25 11:52:08,827 - INFO - Epoch 1 Step 4620 (Global: 4620): loss=1.6151, ppl=5.03, grad_norm=0.80, lr=6.82e-05, throughput=5610 tok/s 2025-11-25 11:53:34,473 - INFO - Epoch 1 Step 4630 (Global: 4630): loss=1.8054, ppl=6.08, grad_norm=0.75, lr=6.80e-05, throughput=5604 tok/s 2025-11-25 11:54:59,954 - INFO - Epoch 1 Step 4640 (Global: 4640): loss=1.7932, ppl=6.01, grad_norm=0.82, lr=6.78e-05, throughput=5615 tok/s 2025-11-25 11:56:25,211 - INFO - Epoch 1 Step 4650 (Global: 4650): loss=1.7464, ppl=5.73, grad_norm=0.73, lr=6.77e-05, throughput=5630 tok/s 2025-11-25 11:57:50,726 - INFO - Epoch 1 Step 4660 (Global: 4660): loss=1.7131, ppl=5.55, grad_norm=0.78, lr=6.75e-05, throughput=5613 tok/s 2025-11-25 11:59:16,145 - INFO - Epoch 1 Step 4670 (Global: 4670): loss=1.7183, ppl=5.58, grad_norm=0.79, lr=6.74e-05, throughput=5619 tok/s 2025-11-25 12:00:41,672 - INFO - Epoch 1 Step 4680 (Global: 4680): loss=1.7556, ppl=5.79, grad_norm=0.77, lr=6.72e-05, throughput=5612 tok/s 2025-11-25 12:02:06,987 - INFO - Epoch 1 Step 4690 (Global: 4690): loss=1.7018, ppl=5.48, grad_norm=0.79, lr=6.71e-05, throughput=5626 tok/s 2025-11-25 12:03:32,463 - INFO - Epoch 1 Step 4700 (Global: 4700): loss=1.6230, ppl=5.07, grad_norm=0.73, lr=6.69e-05, throughput=5616 tok/s 2025-11-25 12:04:57,700 - INFO - Epoch 1 Step 4710 (Global: 4710): loss=1.6929, ppl=5.44, grad_norm=0.77, lr=6.67e-05, throughput=5631 tok/s 2025-11-25 12:06:23,024 - INFO - Epoch 1 Step 4720 (Global: 4720): loss=1.6943, ppl=5.44, grad_norm=0.84, lr=6.66e-05, throughput=5626 tok/s 2025-11-25 12:07:48,503 - INFO - Epoch 1 Step 4730 (Global: 4730): loss=1.8857, ppl=6.59, grad_norm=0.79, lr=6.64e-05, throughput=5615 tok/s 2025-11-25 12:09:14,039 - INFO - Epoch 1 Step 4740 (Global: 4740): loss=1.4843, ppl=4.41, grad_norm=0.76, lr=6.63e-05, throughput=5612 tok/s 2025-11-25 12:10:39,024 - INFO - Epoch 1 Step 4750 (Global: 4750): loss=1.5829, ppl=4.87, grad_norm=0.80, lr=6.61e-05, throughput=5648 tok/s 2025-11-25 12:12:04,341 - INFO - Epoch 1 Step 4760 (Global: 4760): loss=1.4016, ppl=4.06, grad_norm=0.72, lr=6.60e-05, throughput=5626 tok/s 2025-11-25 12:13:29,561 - INFO - Epoch 1 Step 4770 (Global: 4770): loss=1.6067, ppl=4.99, grad_norm=0.74, lr=6.58e-05, throughput=5633 tok/s 2025-11-25 12:14:54,954 - INFO - Epoch 1 Step 4780 (Global: 4780): loss=1.9135, ppl=6.78, grad_norm=0.82, lr=6.56e-05, throughput=5621 tok/s 2025-11-25 12:16:19,937 - INFO - Epoch 1 Step 4790 (Global: 4790): loss=1.8310, ppl=6.24, grad_norm=0.82, lr=6.55e-05, throughput=5648 tok/s 2025-11-25 12:17:45,181 - INFO - Epoch 1 Step 4800 (Global: 4800): loss=1.6730, ppl=5.33, grad_norm=0.82, lr=6.53e-05, throughput=5631 tok/s 2025-11-25 12:19:10,546 - INFO - Epoch 1 Step 4810 (Global: 4810): loss=1.5145, ppl=4.55, grad_norm=0.71, lr=6.52e-05, throughput=5623 tok/s 2025-11-25 12:20:35,769 - INFO - Epoch 1 Step 4820 (Global: 4820): loss=1.4452, ppl=4.24, grad_norm=0.73, lr=6.50e-05, throughput=5632 tok/s 2025-11-25 12:22:01,034 - INFO - Epoch 1 Step 4830 (Global: 4830): loss=1.6267, ppl=5.09, grad_norm=0.78, lr=6.48e-05, throughput=5630 tok/s 2025-11-25 12:23:26,359 - INFO - Epoch 1 Step 4840 (Global: 4840): loss=1.7594, ppl=5.81, grad_norm=0.78, lr=6.47e-05, throughput=5626 tok/s 2025-11-25 12:24:51,521 - INFO - Epoch 1 Step 4850 (Global: 4850): loss=1.6502, ppl=5.21, grad_norm=0.75, lr=6.45e-05, throughput=5636 tok/s 2025-11-25 12:26:16,671 - INFO - Epoch 1 Step 4860 (Global: 4860): loss=1.5997, ppl=4.95, grad_norm=0.79, lr=6.44e-05, throughput=5637 tok/s 2025-11-25 12:27:42,026 - INFO - Epoch 1 Step 4870 (Global: 4870): loss=1.6098, ppl=5.00, grad_norm=0.86, lr=6.42e-05, throughput=5624 tok/s 2025-11-25 12:29:06,922 - INFO - Epoch 1 Step 4880 (Global: 4880): loss=1.7825, ppl=5.94, grad_norm=0.79, lr=6.40e-05, throughput=5654 tok/s 2025-11-25 12:30:32,287 - INFO - Epoch 1 Step 4890 (Global: 4890): loss=1.6684, ppl=5.30, grad_norm=0.74, lr=6.39e-05, throughput=5623 tok/s 2025-11-25 12:31:57,235 - INFO - Epoch 1 Step 4900 (Global: 4900): loss=1.7030, ppl=5.49, grad_norm=0.85, lr=6.37e-05, throughput=5651 tok/s 2025-11-25 12:33:22,415 - INFO - Epoch 1 Step 4910 (Global: 4910): loss=1.4541, ppl=4.28, grad_norm=0.73, lr=6.35e-05, throughput=5635 tok/s 2025-11-25 12:34:47,642 - INFO - Epoch 1 Step 4920 (Global: 4920): loss=1.7073, ppl=5.51, grad_norm=0.91, lr=6.34e-05, throughput=5632 tok/s 2025-11-25 12:36:12,534 - INFO - Epoch 1 Step 4930 (Global: 4930): loss=1.6688, ppl=5.31, grad_norm=0.76, lr=6.32e-05, throughput=5654 tok/s 2025-11-25 12:37:37,703 - INFO - Epoch 1 Step 4940 (Global: 4940): loss=1.9038, ppl=6.71, grad_norm=0.84, lr=6.31e-05, throughput=5636 tok/s 2025-11-25 12:39:02,832 - INFO - Epoch 1 Step 4950 (Global: 4950): loss=1.6588, ppl=5.25, grad_norm=0.74, lr=6.29e-05, throughput=5639 tok/s 2025-11-25 12:40:27,995 - INFO - Epoch 1 Step 4960 (Global: 4960): loss=1.5335, ppl=4.63, grad_norm=0.75, lr=6.27e-05, throughput=5636 tok/s 2025-11-25 12:41:53,193 - INFO - Epoch 1 Step 4970 (Global: 4970): loss=1.7825, ppl=5.94, grad_norm=0.80, lr=6.26e-05, throughput=5634 tok/s 2025-11-25 12:43:18,081 - INFO - Epoch 1 Step 4980 (Global: 4980): loss=1.8433, ppl=6.32, grad_norm=0.82, lr=6.24e-05, throughput=5655 tok/s 2025-11-25 12:44:43,473 - INFO - Epoch 1 Step 4990 (Global: 4990): loss=1.5911, ppl=4.91, grad_norm=0.74, lr=6.23e-05, throughput=5621 tok/s 2025-11-25 12:46:08,396 - INFO - Epoch 1 Step 5000 (Global: 5000): loss=1.7960, ppl=6.03, grad_norm=0.76, lr=6.21e-05, throughput=5652 tok/s 2025-11-25 12:47:33,594 - INFO - Epoch 1 Step 5010 (Global: 5010): loss=1.7441, ppl=5.72, grad_norm=0.79, lr=6.19e-05, throughput=5634 tok/s 2025-11-25 12:48:58,776 - INFO - Epoch 1 Step 5020 (Global: 5020): loss=1.6722, ppl=5.32, grad_norm=0.79, lr=6.18e-05, throughput=5635 tok/s 2025-11-25 12:50:23,940 - INFO - Epoch 1 Step 5030 (Global: 5030): loss=1.7103, ppl=5.53, grad_norm=0.79, lr=6.16e-05, throughput=5636 tok/s 2025-11-25 12:51:49,117 - INFO - Epoch 1 Step 5040 (Global: 5040): loss=1.7239, ppl=5.61, grad_norm=0.80, lr=6.14e-05, throughput=5635 tok/s 2025-11-25 12:53:14,389 - INFO - Epoch 1 Step 5050 (Global: 5050): loss=1.6400, ppl=5.16, grad_norm=0.84, lr=6.13e-05, throughput=5629 tok/s 2025-11-25 12:54:39,545 - INFO - Epoch 1 Step 5060 (Global: 5060): loss=1.6997, ppl=5.47, grad_norm=0.80, lr=6.11e-05, throughput=5637 tok/s 2025-11-25 12:56:04,850 - INFO - Epoch 1 Step 5070 (Global: 5070): loss=1.6799, ppl=5.37, grad_norm=0.76, lr=6.10e-05, throughput=5627 tok/s 2025-11-25 12:57:29,855 - INFO - Epoch 1 Step 5080 (Global: 5080): loss=1.5798, ppl=4.85, grad_norm=0.76, lr=6.08e-05, throughput=5647 tok/s 2025-11-25 12:58:55,232 - INFO - Epoch 1 Step 5090 (Global: 5090): loss=1.6476, ppl=5.19, grad_norm=0.74, lr=6.06e-05, throughput=5622 tok/s 2025-11-25 13:00:20,546 - INFO - Epoch 1 Step 5100 (Global: 5100): loss=1.5890, ppl=4.90, grad_norm=0.75, lr=6.05e-05, throughput=5626 tok/s 2025-11-25 13:01:45,907 - INFO - Epoch 1 Step 5110 (Global: 5110): loss=1.7637, ppl=5.83, grad_norm=0.77, lr=6.03e-05, throughput=5623 tok/s 2025-11-25 13:03:11,324 - INFO - Epoch 1 Step 5120 (Global: 5120): loss=1.6830, ppl=5.38, grad_norm=0.79, lr=6.01e-05, throughput=5620 tok/s 2025-11-25 13:04:36,849 - INFO - Epoch 1 Step 5130 (Global: 5130): loss=1.8651, ppl=6.46, grad_norm=0.77, lr=6.00e-05, throughput=5612 tok/s 2025-11-25 13:06:02,170 - INFO - Epoch 1 Step 5140 (Global: 5140): loss=1.7060, ppl=5.51, grad_norm=0.75, lr=5.98e-05, throughput=5626 tok/s 2025-11-25 13:07:27,309 - INFO - Epoch 1 Step 5150 (Global: 5150): loss=1.8870, ppl=6.60, grad_norm=0.76, lr=5.96e-05, throughput=5638 tok/s 2025-11-25 13:08:52,778 - INFO - Epoch 1 Step 5160 (Global: 5160): loss=1.3688, ppl=3.93, grad_norm=0.71, lr=5.95e-05, throughput=5616 tok/s 2025-11-25 13:10:18,016 - INFO - Epoch 1 Step 5170 (Global: 5170): loss=1.5160, ppl=4.55, grad_norm=0.72, lr=5.93e-05, throughput=5631 tok/s 2025-11-25 13:11:43,219 - INFO - Epoch 1 Step 5180 (Global: 5180): loss=1.6982, ppl=5.46, grad_norm=0.77, lr=5.91e-05, throughput=5634 tok/s 2025-11-25 13:13:09,544 - INFO - Epoch 1 Step 5190 (Global: 5190): loss=1.6156, ppl=5.03, grad_norm=0.78, lr=5.90e-05, throughput=5560 tok/s 2025-11-25 13:14:41,018 - INFO - Epoch 1 Step 5200 (Global: 5200): loss=1.8424, ppl=6.31, grad_norm=0.79, lr=5.88e-05, throughput=5247 tok/s 2025-11-25 13:16:18,484 - INFO - Epoch 1 Step 5210 (Global: 5210): loss=1.7458, ppl=5.73, grad_norm=0.77, lr=5.87e-05, throughput=4925 tok/s 2025-11-25 13:17:54,727 - INFO - Epoch 1 Step 5220 (Global: 5220): loss=1.5854, ppl=4.88, grad_norm=0.71, lr=5.85e-05, throughput=4987 tok/s 2025-11-25 13:19:29,265 - INFO - Epoch 1 Step 5230 (Global: 5230): loss=1.7010, ppl=5.48, grad_norm=0.73, lr=5.83e-05, throughput=5077 tok/s 2025-11-25 13:20:56,180 - INFO - Epoch 1 Step 5240 (Global: 5240): loss=1.6023, ppl=4.96, grad_norm=0.72, lr=5.82e-05, throughput=5523 tok/s 2025-11-25 13:22:22,674 - INFO - Epoch 1 Step 5250 (Global: 5250): loss=1.6758, ppl=5.34, grad_norm=0.76, lr=5.80e-05, throughput=5550 tok/s 2025-11-25 13:23:50,730 - INFO - Epoch 1 Step 5260 (Global: 5260): loss=1.6706, ppl=5.32, grad_norm=0.81, lr=5.78e-05, throughput=5451 tok/s 2025-11-25 13:25:18,692 - INFO - Epoch 1 Step 5270 (Global: 5270): loss=1.7422, ppl=5.71, grad_norm=0.82, lr=5.77e-05, throughput=5457 tok/s 2025-11-25 13:26:45,247 - INFO - Epoch 1 Step 5280 (Global: 5280): loss=1.7122, ppl=5.54, grad_norm=0.82, lr=5.75e-05, throughput=5546 tok/s 2025-11-25 13:28:11,263 - INFO - Epoch 1 Step 5290 (Global: 5290): loss=1.5804, ppl=4.86, grad_norm=0.79, lr=5.73e-05, throughput=5580 tok/s 2025-11-25 13:29:36,540 - INFO - Epoch 1 Step 5300 (Global: 5300): loss=1.6390, ppl=5.15, grad_norm=0.81, lr=5.72e-05, throughput=5629 tok/s 2025-11-25 13:31:01,816 - INFO - Epoch 1 Step 5310 (Global: 5310): loss=1.6632, ppl=5.28, grad_norm=0.79, lr=5.70e-05, throughput=5629 tok/s 2025-11-25 13:32:27,265 - INFO - Epoch 1 Step 5320 (Global: 5320): loss=1.6496, ppl=5.20, grad_norm=0.77, lr=5.68e-05, throughput=5617 tok/s 2025-11-25 13:33:54,273 - INFO - Epoch 1 Step 5330 (Global: 5330): loss=1.6286, ppl=5.10, grad_norm=0.74, lr=5.67e-05, throughput=5517 tok/s 2025-11-25 13:35:24,201 - INFO - Epoch 1 Step 5340 (Global: 5340): loss=1.7805, ppl=5.93, grad_norm=0.79, lr=5.65e-05, throughput=5338 tok/s 2025-11-25 13:36:58,219 - INFO - Epoch 1 Step 5350 (Global: 5350): loss=1.7969, ppl=6.03, grad_norm=0.79, lr=5.63e-05, throughput=5105 tok/s 2025-11-25 13:38:31,653 - INFO - Epoch 1 Step 5360 (Global: 5360): loss=1.5603, ppl=4.76, grad_norm=0.76, lr=5.62e-05, throughput=5137 tok/s 2025-11-25 13:40:01,100 - INFO - Epoch 1 Step 5370 (Global: 5370): loss=1.7181, ppl=5.57, grad_norm=0.77, lr=5.60e-05, throughput=5366 tok/s 2025-11-25 13:41:29,367 - INFO - Epoch 1 Step 5380 (Global: 5380): loss=1.6177, ppl=5.04, grad_norm=0.77, lr=5.58e-05, throughput=5438 tok/s 2025-11-25 13:42:55,169 - INFO - Epoch 1 Step 5390 (Global: 5390): loss=1.7248, ppl=5.61, grad_norm=0.77, lr=5.57e-05, throughput=5594 tok/s 2025-11-25 13:44:20,010 - INFO - Epoch 1 Step 5400 (Global: 5400): loss=1.5140, ppl=4.54, grad_norm=0.78, lr=5.55e-05, throughput=5658 tok/s 2025-11-25 13:45:48,865 - INFO - Epoch 1 Step 5410 (Global: 5410): loss=1.4615, ppl=4.31, grad_norm=0.77, lr=5.53e-05, throughput=5402 tok/s 2025-11-25 13:47:18,767 - INFO - Epoch 1 Step 5420 (Global: 5420): loss=1.3620, ppl=3.90, grad_norm=0.73, lr=5.52e-05, throughput=5339 tok/s 2025-11-25 13:48:49,056 - INFO - Epoch 1 Step 5430 (Global: 5430): loss=1.5649, ppl=4.78, grad_norm=0.82, lr=5.50e-05, throughput=5316 tok/s 2025-11-25 13:50:17,554 - INFO - Epoch 1 Step 5440 (Global: 5440): loss=1.4400, ppl=4.22, grad_norm=0.74, lr=5.48e-05, throughput=5424 tok/s 2025-11-25 13:51:43,490 - INFO - Epoch 1 Step 5450 (Global: 5450): loss=1.7350, ppl=5.67, grad_norm=0.79, lr=5.47e-05, throughput=5586 tok/s 2025-11-25 13:53:08,721 - INFO - Epoch 1 Step 5460 (Global: 5460): loss=1.5705, ppl=4.81, grad_norm=0.75, lr=5.45e-05, throughput=5632 tok/s 2025-11-25 13:54:35,360 - INFO - Epoch 1 Step 5470 (Global: 5470): loss=1.9260, ppl=6.86, grad_norm=0.81, lr=5.43e-05, throughput=5540 tok/s 2025-11-25 13:56:05,930 - INFO - Epoch 1 Step 5480 (Global: 5480): loss=1.5126, ppl=4.54, grad_norm=0.73, lr=5.42e-05, throughput=5300 tok/s 2025-11-25 13:57:37,479 - INFO - Epoch 1 Step 5490 (Global: 5490): loss=1.6386, ppl=5.15, grad_norm=0.75, lr=5.40e-05, throughput=5243 tok/s 2025-11-25 13:59:08,755 - INFO - Epoch 1 Step 5500 (Global: 5500): loss=1.6709, ppl=5.32, grad_norm=0.73, lr=5.38e-05, throughput=5259 tok/s 2025-11-25 14:00:39,324 - INFO - Epoch 1 Step 5510 (Global: 5510): loss=1.7227, ppl=5.60, grad_norm=0.75, lr=5.37e-05, throughput=5300 tok/s 2025-11-25 14:02:05,846 - INFO - Epoch 1 Step 5520 (Global: 5520): loss=1.7948, ppl=6.02, grad_norm=0.78, lr=5.35e-05, throughput=5548 tok/s 2025-11-25 14:03:32,149 - INFO - Epoch 1 Step 5530 (Global: 5530): loss=1.7026, ppl=5.49, grad_norm=0.83, lr=5.33e-05, throughput=5562 tok/s 2025-11-25 14:04:58,858 - INFO - Epoch 1 Step 5540 (Global: 5540): loss=1.5682, ppl=4.80, grad_norm=0.73, lr=5.32e-05, throughput=5536 tok/s 2025-11-25 14:06:31,873 - INFO - Epoch 1 Step 5550 (Global: 5550): loss=1.5477, ppl=4.70, grad_norm=0.74, lr=5.30e-05, throughput=5161 tok/s 2025-11-25 14:08:01,795 - INFO - Epoch 1 Step 5560 (Global: 5560): loss=1.6716, ppl=5.32, grad_norm=0.73, lr=5.28e-05, throughput=5338 tok/s 2025-11-25 14:09:26,804 - INFO - Epoch 1 Step 5570 (Global: 5570): loss=1.5890, ppl=4.90, grad_norm=0.79, lr=5.27e-05, throughput=5647 tok/s 2025-11-25 14:10:52,197 - INFO - Epoch 1 Step 5580 (Global: 5580): loss=1.6283, ppl=5.10, grad_norm=0.75, lr=5.25e-05, throughput=5621 tok/s 2025-11-25 14:12:18,740 - INFO - Epoch 1 Step 5590 (Global: 5590): loss=1.6918, ppl=5.43, grad_norm=0.77, lr=5.23e-05, throughput=5546 tok/s 2025-11-25 14:13:45,995 - INFO - Epoch 1 Step 5600 (Global: 5600): loss=1.4961, ppl=4.46, grad_norm=0.73, lr=5.22e-05, throughput=5501 tok/s 2025-11-25 14:15:18,235 - INFO - Epoch 1 Step 5610 (Global: 5610): loss=1.6466, ppl=5.19, grad_norm=0.77, lr=5.20e-05, throughput=5204 tok/s 2025-11-25 14:16:43,161 - INFO - Epoch 1 Step 5620 (Global: 5620): loss=1.4039, ppl=4.07, grad_norm=0.70, lr=5.18e-05, throughput=5652 tok/s 2025-11-25 14:18:08,278 - INFO - Epoch 1 Step 5630 (Global: 5630): loss=1.6020, ppl=4.96, grad_norm=0.73, lr=5.17e-05, throughput=5639 tok/s 2025-11-25 14:19:33,554 - INFO - Epoch 1 Step 5640 (Global: 5640): loss=1.6685, ppl=5.30, grad_norm=0.78, lr=5.15e-05, throughput=5629 tok/s 2025-11-25 14:20:58,855 - INFO - Epoch 1 Step 5650 (Global: 5650): loss=1.7082, ppl=5.52, grad_norm=0.76, lr=5.13e-05, throughput=5627 tok/s 2025-11-25 14:22:24,308 - INFO - Epoch 1 Step 5660 (Global: 5660): loss=1.7114, ppl=5.54, grad_norm=0.79, lr=5.12e-05, throughput=5617 tok/s 2025-11-25 14:23:49,453 - INFO - Epoch 1 Step 5670 (Global: 5670): loss=1.4130, ppl=4.11, grad_norm=0.70, lr=5.10e-05, throughput=5638 tok/s 2025-11-25 14:25:14,457 - INFO - Epoch 1 Step 5680 (Global: 5680): loss=1.7155, ppl=5.56, grad_norm=0.79, lr=5.08e-05, throughput=5647 tok/s 2025-11-25 14:26:39,434 - INFO - Epoch 1 Step 5690 (Global: 5690): loss=1.8053, ppl=6.08, grad_norm=0.80, lr=5.07e-05, throughput=5649 tok/s 2025-11-25 14:28:05,721 - INFO - Epoch 1 Step 5700 (Global: 5700): loss=1.8151, ppl=6.14, grad_norm=0.76, lr=5.05e-05, throughput=5563 tok/s 2025-11-25 14:29:33,652 - INFO - Epoch 1 Step 5710 (Global: 5710): loss=1.6987, ppl=5.47, grad_norm=0.75, lr=5.03e-05, throughput=5459 tok/s 2025-11-25 14:30:58,674 - INFO - Epoch 1 Step 5720 (Global: 5720): loss=1.6380, ppl=5.14, grad_norm=0.75, lr=5.02e-05, throughput=5646 tok/s 2025-11-25 14:32:23,934 - INFO - Epoch 1 Step 5730 (Global: 5730): loss=1.8868, ppl=6.60, grad_norm=0.78, lr=5.00e-05, throughput=5630 tok/s 2025-11-25 14:33:49,156 - INFO - Epoch 1 Step 5740 (Global: 5740): loss=1.4963, ppl=4.47, grad_norm=0.73, lr=4.98e-05, throughput=5632 tok/s 2025-11-25 14:35:17,589 - INFO - Epoch 1 Step 5750 (Global: 5750): loss=1.7653, ppl=5.84, grad_norm=0.79, lr=4.96e-05, throughput=5428 tok/s 2025-11-25 14:36:47,928 - INFO - Epoch 1 Step 5760 (Global: 5760): loss=1.7820, ppl=5.94, grad_norm=0.79, lr=4.95e-05, throughput=5313 tok/s 2025-11-25 14:38:13,338 - INFO - Epoch 1 Step 5770 (Global: 5770): loss=1.7892, ppl=5.98, grad_norm=0.79, lr=4.93e-05, throughput=5620 tok/s 2025-11-25 14:39:38,681 - INFO - Epoch 1 Step 5780 (Global: 5780): loss=1.3664, ppl=3.92, grad_norm=0.73, lr=4.91e-05, throughput=5624 tok/s 2025-11-25 14:41:03,624 - INFO - Epoch 1 Step 5790 (Global: 5790): loss=1.5580, ppl=4.75, grad_norm=0.75, lr=4.90e-05, throughput=5651 tok/s 2025-11-25 14:42:28,880 - INFO - Epoch 1 Step 5800 (Global: 5800): loss=1.3882, ppl=4.01, grad_norm=0.73, lr=4.88e-05, throughput=5630 tok/s 2025-11-25 14:43:54,090 - INFO - Epoch 1 Step 5810 (Global: 5810): loss=1.7735, ppl=5.89, grad_norm=0.82, lr=4.86e-05, throughput=5633 tok/s 2025-11-25 14:45:19,177 - INFO - Epoch 1 Step 5820 (Global: 5820): loss=1.5650, ppl=4.78, grad_norm=0.78, lr=4.85e-05, throughput=5641 tok/s 2025-11-25 14:46:44,410 - INFO - Epoch 1 Step 5830 (Global: 5830): loss=1.9268, ppl=6.87, grad_norm=0.80, lr=4.83e-05, throughput=5632 tok/s 2025-11-25 14:48:09,824 - INFO - Epoch 1 Step 5840 (Global: 5840): loss=1.8268, ppl=6.21, grad_norm=0.77, lr=4.81e-05, throughput=5620 tok/s 2025-11-25 14:49:35,133 - INFO - Epoch 1 Step 5850 (Global: 5850): loss=1.5444, ppl=4.69, grad_norm=0.79, lr=4.80e-05, throughput=5627 tok/s 2025-11-25 14:51:00,472 - INFO - Epoch 1 Step 5860 (Global: 5860): loss=1.5215, ppl=4.58, grad_norm=0.76, lr=4.78e-05, throughput=5625 tok/s 2025-11-25 14:52:25,723 - INFO - Epoch 1 Step 5870 (Global: 5870): loss=1.6962, ppl=5.45, grad_norm=0.78, lr=4.76e-05, throughput=5630 tok/s 2025-11-25 14:53:50,848 - INFO - Epoch 1 Step 5880 (Global: 5880): loss=1.7037, ppl=5.49, grad_norm=0.77, lr=4.75e-05, throughput=5639 tok/s 2025-11-25 14:55:16,000 - INFO - Epoch 1 Step 5890 (Global: 5890): loss=1.9777, ppl=7.23, grad_norm=0.78, lr=4.73e-05, throughput=5637 tok/s 2025-11-25 14:56:40,943 - INFO - Epoch 1 Step 5900 (Global: 5900): loss=1.7339, ppl=5.66, grad_norm=0.78, lr=4.71e-05, throughput=5651 tok/s 2025-11-25 14:58:05,881 - INFO - Epoch 1 Step 5910 (Global: 5910): loss=1.7876, ppl=5.98, grad_norm=0.77, lr=4.70e-05, throughput=5651 tok/s 2025-11-25 14:59:30,791 - INFO - Epoch 1 Step 5920 (Global: 5920): loss=1.8490, ppl=6.35, grad_norm=0.78, lr=4.68e-05, throughput=5653 tok/s 2025-11-25 15:00:55,846 - INFO - Epoch 1 Step 5930 (Global: 5930): loss=1.3624, ppl=3.91, grad_norm=0.68, lr=4.66e-05, throughput=5643 tok/s 2025-11-25 15:02:20,715 - INFO - Epoch 1 Step 5940 (Global: 5940): loss=1.6762, ppl=5.34, grad_norm=0.81, lr=4.65e-05, throughput=5656 tok/s 2025-11-25 15:03:46,001 - INFO - Epoch 1 Step 5950 (Global: 5950): loss=1.6655, ppl=5.29, grad_norm=0.82, lr=4.63e-05, throughput=5628 tok/s 2025-11-25 15:05:11,239 - INFO - Epoch 1 Step 5960 (Global: 5960): loss=1.6237, ppl=5.07, grad_norm=0.75, lr=4.61e-05, throughput=5631 tok/s 2025-11-25 15:06:36,443 - INFO - Epoch 1 Step 5970 (Global: 5970): loss=1.5801, ppl=4.86, grad_norm=0.91, lr=4.60e-05, throughput=5634 tok/s 2025-11-25 15:08:01,791 - INFO - Epoch 1 Step 5980 (Global: 5980): loss=1.7458, ppl=5.73, grad_norm=0.78, lr=4.58e-05, throughput=5624 tok/s 2025-11-25 15:09:27,192 - INFO - Epoch 1 Step 5990 (Global: 5990): loss=1.7158, ppl=5.56, grad_norm=0.82, lr=4.56e-05, throughput=5621 tok/s 2025-11-25 15:10:52,738 - INFO - Epoch 1 Step 6000 (Global: 6000): loss=1.6742, ppl=5.33, grad_norm=0.79, lr=4.55e-05, throughput=5611 tok/s 2025-11-25 15:10:52,738 - INFO - Running validation at step 6000... 2025-11-25 15:14:55,726 - INFO - Validation loss: 1.6476, perplexity: 5.19 2025-11-25 15:15:18,653 - INFO - Saved checkpoint to outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-25 15:15:18,658 - INFO - New best validation loss: 1.6476, perplexity: 5.19 2025-11-25 15:16:44,056 - INFO - Epoch 1 Step 6010 (Global: 6010): loss=1.6613, ppl=5.27, grad_norm=0.77, lr=4.53e-05, throughput=5621 tok/s 2025-11-25 15:18:09,522 - INFO - Epoch 1 Step 6020 (Global: 6020): loss=1.4127, ppl=4.11, grad_norm=0.75, lr=4.51e-05, throughput=5616 tok/s 2025-11-25 15:19:34,835 - INFO - Epoch 1 Step 6030 (Global: 6030): loss=1.6897, ppl=5.42, grad_norm=0.76, lr=4.50e-05, throughput=5626 tok/s 2025-11-25 15:21:00,171 - INFO - Epoch 1 Step 6040 (Global: 6040): loss=1.4493, ppl=4.26, grad_norm=0.76, lr=4.48e-05, throughput=5625 tok/s 2025-11-25 15:22:25,447 - INFO - Epoch 1 Step 6050 (Global: 6050): loss=1.6553, ppl=5.23, grad_norm=0.77, lr=4.46e-05, throughput=5629 tok/s 2025-11-25 15:23:50,562 - INFO - Epoch 1 Step 6060 (Global: 6060): loss=1.7346, ppl=5.67, grad_norm=0.77, lr=4.45e-05, throughput=5639 tok/s 2025-11-25 15:25:15,490 - INFO - Epoch 1 Step 6070 (Global: 6070): loss=1.5347, ppl=4.64, grad_norm=0.75, lr=4.43e-05, throughput=5652 tok/s 2025-11-25 15:26:40,625 - INFO - Epoch 1 Step 6080 (Global: 6080): loss=1.7341, ppl=5.66, grad_norm=0.78, lr=4.41e-05, throughput=5638 tok/s 2025-11-25 15:28:05,676 - INFO - Epoch 1 Step 6090 (Global: 6090): loss=1.7720, ppl=5.88, grad_norm=0.76, lr=4.40e-05, throughput=5644 tok/s 2025-11-25 15:29:30,888 - INFO - Epoch 1 Step 6100 (Global: 6100): loss=1.5808, ppl=4.86, grad_norm=0.74, lr=4.38e-05, throughput=5633 tok/s 2025-11-25 15:30:56,548 - INFO - Epoch 1 Step 6110 (Global: 6110): loss=1.6714, ppl=5.32, grad_norm=0.76, lr=4.36e-05, throughput=5604 tok/s 2025-11-25 15:32:24,387 - INFO - Epoch 1 Step 6120 (Global: 6120): loss=1.6959, ppl=5.45, grad_norm=0.73, lr=4.35e-05, throughput=5465 tok/s 2025-11-25 15:33:54,876 - INFO - Epoch 1 Step 6130 (Global: 6130): loss=1.5009, ppl=4.49, grad_norm=0.75, lr=4.33e-05, throughput=5305 tok/s 2025-11-25 15:35:19,974 - INFO - Epoch 1 Step 6140 (Global: 6140): loss=1.5823, ppl=4.87, grad_norm=0.77, lr=4.31e-05, throughput=5641 tok/s 2025-11-25 15:36:46,321 - INFO - Epoch 1 Step 6150 (Global: 6150): loss=1.5025, ppl=4.49, grad_norm=0.77, lr=4.30e-05, throughput=5559 tok/s 2025-11-25 15:38:18,680 - INFO - Epoch 1 Step 6160 (Global: 6160): loss=1.7951, ppl=6.02, grad_norm=0.78, lr=4.28e-05, throughput=5197 tok/s 2025-11-25 15:39:44,242 - INFO - Epoch 1 Step 6170 (Global: 6170): loss=1.7445, ppl=5.72, grad_norm=0.79, lr=4.26e-05, throughput=5610 tok/s 2025-11-25 15:41:09,367 - INFO - Epoch 1 Step 6180 (Global: 6180): loss=1.5446, ppl=4.69, grad_norm=0.69, lr=4.25e-05, throughput=5639 tok/s 2025-11-25 15:42:34,463 - INFO - Epoch 1 Step 6190 (Global: 6190): loss=1.5005, ppl=4.48, grad_norm=0.72, lr=4.23e-05, throughput=5641 tok/s 2025-11-25 15:43:59,826 - INFO - Epoch 1 Step 6200 (Global: 6200): loss=1.5647, ppl=4.78, grad_norm=0.70, lr=4.21e-05, throughput=5623 tok/s 2025-11-25 15:45:24,930 - INFO - Epoch 1 Step 6210 (Global: 6210): loss=1.6570, ppl=5.24, grad_norm=0.81, lr=4.20e-05, throughput=5640 tok/s 2025-11-25 15:46:49,766 - INFO - Epoch 1 Step 6220 (Global: 6220): loss=1.5501, ppl=4.71, grad_norm=0.70, lr=4.18e-05, throughput=5658 tok/s 2025-11-25 15:48:14,782 - INFO - Epoch 1 Step 6230 (Global: 6230): loss=1.7842, ppl=5.95, grad_norm=0.77, lr=4.16e-05, throughput=5646 tok/s 2025-11-25 15:49:39,814 - INFO - Epoch 1 Step 6240 (Global: 6240): loss=1.9481, ppl=7.02, grad_norm=0.79, lr=4.15e-05, throughput=5645 tok/s 2025-11-25 15:51:04,815 - INFO - Epoch 1 Step 6250 (Global: 6250): loss=1.7771, ppl=5.91, grad_norm=0.79, lr=4.13e-05, throughput=5647 tok/s 2025-11-25 15:52:29,766 - INFO - Epoch 1 Step 6260 (Global: 6260): loss=1.7579, ppl=5.80, grad_norm=0.75, lr=4.12e-05, throughput=5650 tok/s 2025-11-25 15:53:54,818 - INFO - Epoch 1 Step 6270 (Global: 6270): loss=1.4921, ppl=4.45, grad_norm=0.71, lr=4.10e-05, throughput=5644 tok/s 2025-11-25 15:55:19,821 - INFO - Epoch 1 Step 6280 (Global: 6280): loss=1.7413, ppl=5.70, grad_norm=0.75, lr=4.08e-05, throughput=5647 tok/s 2025-11-25 15:56:44,842 - INFO - Epoch 1 Step 6290 (Global: 6290): loss=1.6640, ppl=5.28, grad_norm=0.77, lr=4.07e-05, throughput=5646 tok/s 2025-11-25 15:58:10,080 - INFO - Epoch 1 Step 6300 (Global: 6300): loss=1.7255, ppl=5.62, grad_norm=0.80, lr=4.05e-05, throughput=5631 tok/s 2025-11-25 15:59:35,364 - INFO - Epoch 1 Step 6310 (Global: 6310): loss=1.6756, ppl=5.34, grad_norm=0.79, lr=4.03e-05, throughput=5628 tok/s 2025-11-25 16:01:00,364 - INFO - Epoch 1 Step 6320 (Global: 6320): loss=1.6610, ppl=5.26, grad_norm=0.75, lr=4.02e-05, throughput=5647 tok/s 2025-11-25 16:02:26,194 - INFO - Epoch 1 Step 6330 (Global: 6330): loss=1.7202, ppl=5.59, grad_norm=0.78, lr=4.00e-05, throughput=5592 tok/s 2025-11-25 16:03:55,718 - INFO - Epoch 1 Step 6340 (Global: 6340): loss=1.6272, ppl=5.09, grad_norm=0.73, lr=3.98e-05, throughput=5362 tok/s 2025-11-25 16:05:26,026 - INFO - Epoch 1 Step 6350 (Global: 6350): loss=1.5147, ppl=4.55, grad_norm=0.71, lr=3.97e-05, throughput=5315 tok/s 2025-11-25 16:06:51,531 - INFO - Epoch 1 Step 6360 (Global: 6360): loss=1.5976, ppl=4.94, grad_norm=0.81, lr=3.95e-05, throughput=5614 tok/s 2025-11-25 16:08:17,199 - INFO - Epoch 1 Step 6370 (Global: 6370): loss=1.7594, ppl=5.81, grad_norm=0.80, lr=3.93e-05, throughput=5603 tok/s 2025-11-25 16:09:42,901 - INFO - Epoch 1 Step 6380 (Global: 6380): loss=1.3936, ppl=4.03, grad_norm=0.68, lr=3.92e-05, throughput=5601 tok/s 2025-11-25 16:11:08,897 - INFO - Epoch 1 Step 6390 (Global: 6390): loss=1.7441, ppl=5.72, grad_norm=0.76, lr=3.90e-05, throughput=5582 tok/s 2025-11-25 16:12:33,876 - INFO - Epoch 1 Step 6400 (Global: 6400): loss=1.5182, ppl=4.56, grad_norm=0.75, lr=3.89e-05, throughput=5649 tok/s 2025-11-25 16:13:59,089 - INFO - Epoch 1 Step 6410 (Global: 6410): loss=1.3852, ppl=4.00, grad_norm=0.68, lr=3.87e-05, throughput=5633 tok/s 2025-11-25 16:15:24,337 - INFO - Epoch 1 Step 6420 (Global: 6420): loss=1.5526, ppl=4.72, grad_norm=0.75, lr=3.85e-05, throughput=5631 tok/s 2025-11-25 16:16:49,288 - INFO - Epoch 1 Step 6430 (Global: 6430): loss=1.4766, ppl=4.38, grad_norm=0.74, lr=3.84e-05, throughput=5650 tok/s 2025-11-25 16:18:14,570 - INFO - Epoch 1 Step 6440 (Global: 6440): loss=1.5988, ppl=4.95, grad_norm=0.77, lr=3.82e-05, throughput=5628 tok/s 2025-11-25 16:19:39,827 - INFO - Epoch 1 Step 6450 (Global: 6450): loss=1.5791, ppl=4.85, grad_norm=0.73, lr=3.80e-05, throughput=5630 tok/s 2025-11-25 16:21:05,221 - INFO - Epoch 1 Step 6460 (Global: 6460): loss=1.5925, ppl=4.92, grad_norm=0.77, lr=3.79e-05, throughput=5621 tok/s 2025-11-25 16:22:30,481 - INFO - Epoch 1 Step 6470 (Global: 6470): loss=1.7093, ppl=5.53, grad_norm=0.75, lr=3.77e-05, throughput=5630 tok/s 2025-11-25 16:23:55,633 - INFO - Epoch 1 Step 6480 (Global: 6480): loss=1.7019, ppl=5.48, grad_norm=0.76, lr=3.76e-05, throughput=5637 tok/s 2025-11-25 16:25:21,123 - INFO - Epoch 1 Step 6490 (Global: 6490): loss=1.4934, ppl=4.45, grad_norm=0.77, lr=3.74e-05, throughput=5615 tok/s 2025-11-25 16:26:46,508 - INFO - Epoch 1 Step 6500 (Global: 6500): loss=1.7999, ppl=6.05, grad_norm=1.30, lr=3.72e-05, throughput=5622 tok/s 2025-11-25 16:28:11,737 - INFO - Epoch 1 Step 6510 (Global: 6510): loss=1.8464, ppl=6.34, grad_norm=0.78, lr=3.71e-05, throughput=5632 tok/s 2025-11-25 16:29:36,610 - INFO - Epoch 1 Step 6520 (Global: 6520): loss=1.6469, ppl=5.19, grad_norm=0.76, lr=3.69e-05, throughput=5656 tok/s 2025-11-25 16:31:01,743 - INFO - Epoch 1 Step 6530 (Global: 6530): loss=1.5240, ppl=4.59, grad_norm=0.75, lr=3.67e-05, throughput=5638 tok/s 2025-11-25 16:32:26,996 - INFO - Epoch 1 Step 6540 (Global: 6540): loss=1.5371, ppl=4.65, grad_norm=0.73, lr=3.66e-05, throughput=5630 tok/s 2025-11-25 16:33:52,175 - INFO - Epoch 1 Step 6550 (Global: 6550): loss=1.6283, ppl=5.09, grad_norm=0.72, lr=3.64e-05, throughput=5635 tok/s 2025-11-25 16:35:17,205 - INFO - Epoch 1 Step 6560 (Global: 6560): loss=1.6718, ppl=5.32, grad_norm=0.74, lr=3.63e-05, throughput=5645 tok/s 2025-11-25 16:36:42,322 - INFO - Epoch 1 Step 6570 (Global: 6570): loss=1.5587, ppl=4.75, grad_norm=0.74, lr=3.61e-05, throughput=5639 tok/s 2025-11-25 16:38:07,398 - INFO - Epoch 1 Step 6580 (Global: 6580): loss=1.6471, ppl=5.19, grad_norm=0.79, lr=3.59e-05, throughput=5642 tok/s 2025-11-25 16:39:32,454 - INFO - Epoch 1 Step 6590 (Global: 6590): loss=1.4744, ppl=4.37, grad_norm=0.75, lr=3.58e-05, throughput=5643 tok/s 2025-11-25 16:40:57,817 - INFO - Epoch 1 Step 6600 (Global: 6600): loss=1.6932, ppl=5.44, grad_norm=0.79, lr=3.56e-05, throughput=5623 tok/s 2025-11-25 16:42:23,134 - INFO - Epoch 1 Step 6610 (Global: 6610): loss=1.6523, ppl=5.22, grad_norm=0.76, lr=3.55e-05, throughput=5626 tok/s 2025-11-25 16:43:48,427 - INFO - Epoch 1 Step 6620 (Global: 6620): loss=1.6432, ppl=5.17, grad_norm=0.75, lr=3.53e-05, throughput=5628 tok/s 2025-11-25 16:45:13,826 - INFO - Epoch 1 Step 6630 (Global: 6630): loss=1.5038, ppl=4.50, grad_norm=0.76, lr=3.51e-05, throughput=5621 tok/s 2025-11-25 16:46:39,034 - INFO - Epoch 1 Step 6640 (Global: 6640): loss=1.5216, ppl=4.58, grad_norm=0.74, lr=3.50e-05, throughput=5633 tok/s 2025-11-25 16:48:04,011 - INFO - Epoch 1 Step 6650 (Global: 6650): loss=1.6149, ppl=5.03, grad_norm=0.76, lr=3.48e-05, throughput=5649 tok/s 2025-11-25 16:49:29,173 - INFO - Epoch 1 Step 6660 (Global: 6660): loss=1.5354, ppl=4.64, grad_norm=0.79, lr=3.47e-05, throughput=5636 tok/s 2025-11-25 16:50:54,737 - INFO - Epoch 1 Step 6670 (Global: 6670): loss=1.4674, ppl=4.34, grad_norm=0.72, lr=3.45e-05, throughput=5610 tok/s 2025-11-25 16:52:20,208 - INFO - Epoch 1 Step 6680 (Global: 6680): loss=1.7865, ppl=5.97, grad_norm=0.79, lr=3.43e-05, throughput=5616 tok/s 2025-11-25 16:53:45,795 - INFO - Epoch 1 Step 6690 (Global: 6690): loss=1.6796, ppl=5.36, grad_norm=0.79, lr=3.42e-05, throughput=5608 tok/s 2025-11-25 16:55:11,414 - INFO - Epoch 1 Step 6700 (Global: 6700): loss=1.7570, ppl=5.80, grad_norm=0.77, lr=3.40e-05, throughput=5606 tok/s 2025-11-25 16:56:36,940 - INFO - Epoch 1 Step 6710 (Global: 6710): loss=1.6825, ppl=5.38, grad_norm=0.80, lr=3.39e-05, throughput=5612 tok/s 2025-11-25 16:58:02,921 - INFO - Epoch 1 Step 6720 (Global: 6720): loss=1.6433, ppl=5.17, grad_norm=0.77, lr=3.37e-05, throughput=5583 tok/s 2025-11-25 16:59:28,443 - INFO - Epoch 1 Step 6730 (Global: 6730): loss=1.6572, ppl=5.24, grad_norm=0.77, lr=3.35e-05, throughput=5613 tok/s 2025-11-25 17:00:54,223 - INFO - Epoch 1 Step 6740 (Global: 6740): loss=1.6641, ppl=5.28, grad_norm=0.73, lr=3.34e-05, throughput=5596 tok/s 2025-11-25 17:02:20,092 - INFO - Epoch 1 Step 6750 (Global: 6750): loss=1.5408, ppl=4.67, grad_norm=0.74, lr=3.32e-05, throughput=5590 tok/s 2025-11-25 17:03:45,963 - INFO - Epoch 1 Step 6760 (Global: 6760): loss=1.7917, ppl=6.00, grad_norm=0.80, lr=3.31e-05, throughput=5590 tok/s 2025-11-25 17:05:12,113 - INFO - Epoch 1 Step 6770 (Global: 6770): loss=1.5919, ppl=4.91, grad_norm=0.76, lr=3.29e-05, throughput=5572 tok/s 2025-11-25 17:06:37,704 - INFO - Epoch 1 Step 6780 (Global: 6780): loss=1.6270, ppl=5.09, grad_norm=0.77, lr=3.28e-05, throughput=5608 tok/s 2025-11-25 17:08:02,970 - INFO - Epoch 1 Step 6790 (Global: 6790): loss=1.7656, ppl=5.85, grad_norm=0.77, lr=3.26e-05, throughput=5629 tok/s 2025-11-25 17:09:28,636 - INFO - Epoch 1 Step 6800 (Global: 6800): loss=1.6961, ppl=5.45, grad_norm=0.75, lr=3.24e-05, throughput=5603 tok/s 2025-11-25 17:10:53,861 - INFO - Epoch 1 Step 6810 (Global: 6810): loss=1.5152, ppl=4.55, grad_norm=0.77, lr=3.23e-05, throughput=5632 tok/s 2025-11-25 17:12:19,177 - INFO - Epoch 1 Step 6820 (Global: 6820): loss=1.6440, ppl=5.18, grad_norm=0.79, lr=3.21e-05, throughput=5626 tok/s 2025-11-25 17:13:44,755 - INFO - Epoch 1 Step 6830 (Global: 6830): loss=1.6330, ppl=5.12, grad_norm=0.74, lr=3.20e-05, throughput=5609 tok/s 2025-11-25 17:15:10,320 - INFO - Epoch 1 Step 6840 (Global: 6840): loss=1.6345, ppl=5.13, grad_norm=0.75, lr=3.18e-05, throughput=5610 tok/s 2025-11-25 17:16:35,784 - INFO - Epoch 1 Step 6850 (Global: 6850): loss=1.5363, ppl=4.65, grad_norm=0.73, lr=3.17e-05, throughput=5616 tok/s 2025-11-25 17:18:01,427 - INFO - Epoch 1 Step 6860 (Global: 6860): loss=1.4773, ppl=4.38, grad_norm=0.72, lr=3.15e-05, throughput=5605 tok/s 2025-11-25 17:19:27,250 - INFO - Epoch 1 Step 6870 (Global: 6870): loss=1.7434, ppl=5.72, grad_norm=0.77, lr=3.13e-05, throughput=5593 tok/s 2025-11-25 17:20:52,613 - INFO - Epoch 1 Step 6880 (Global: 6880): loss=1.4292, ppl=4.18, grad_norm=0.71, lr=3.12e-05, throughput=5623 tok/s 2025-11-25 17:22:18,187 - INFO - Epoch 1 Step 6890 (Global: 6890): loss=1.7007, ppl=5.48, grad_norm=0.77, lr=3.10e-05, throughput=5609 tok/s 2025-11-25 17:23:44,203 - INFO - Epoch 1 Step 6900 (Global: 6900): loss=1.3914, ppl=4.02, grad_norm=0.74, lr=3.09e-05, throughput=5580 tok/s 2025-11-25 17:25:14,376 - INFO - Epoch 1 Step 6910 (Global: 6910): loss=1.6778, ppl=5.35, grad_norm=0.77, lr=3.07e-05, throughput=5323 tok/s 2025-11-25 17:26:45,704 - INFO - Epoch 1 Step 6920 (Global: 6920): loss=1.6518, ppl=5.22, grad_norm=0.78, lr=3.06e-05, throughput=5256 tok/s 2025-11-25 17:28:11,477 - INFO - Epoch 1 Step 6930 (Global: 6930): loss=1.4984, ppl=4.47, grad_norm=0.77, lr=3.04e-05, throughput=5596 tok/s 2025-11-25 17:29:37,004 - INFO - Epoch 1 Step 6940 (Global: 6940): loss=1.6317, ppl=5.11, grad_norm=0.75, lr=3.03e-05, throughput=5612 tok/s 2025-11-25 17:31:02,227 - INFO - Epoch 1 Step 6950 (Global: 6950): loss=1.5967, ppl=4.94, grad_norm=0.73, lr=3.01e-05, throughput=5632 tok/s 2025-11-25 17:32:27,746 - INFO - Epoch 1 Step 6960 (Global: 6960): loss=1.3361, ppl=3.80, grad_norm=0.71, lr=3.00e-05, throughput=5613 tok/s 2025-11-25 17:33:53,462 - INFO - Epoch 1 Step 6970 (Global: 6970): loss=1.6209, ppl=5.06, grad_norm=0.80, lr=2.98e-05, throughput=5600 tok/s 2025-11-25 17:35:19,100 - INFO - Epoch 1 Step 6980 (Global: 6980): loss=1.6808, ppl=5.37, grad_norm=0.77, lr=2.96e-05, throughput=5605 tok/s 2025-11-25 17:36:44,550 - INFO - Epoch 1 Step 6990 (Global: 6990): loss=1.5521, ppl=4.72, grad_norm=0.76, lr=2.95e-05, throughput=5617 tok/s 2025-11-25 17:38:10,511 - INFO - Epoch 1 Step 7000 (Global: 7000): loss=1.6652, ppl=5.29, grad_norm=0.84, lr=2.93e-05, throughput=5584 tok/s 2025-11-25 17:39:36,070 - INFO - Epoch 1 Step 7010 (Global: 7010): loss=1.8633, ppl=6.44, grad_norm=0.78, lr=2.92e-05, throughput=5610 tok/s 2025-11-25 17:41:01,434 - INFO - Epoch 1 Step 7020 (Global: 7020): loss=1.4496, ppl=4.26, grad_norm=0.77, lr=2.90e-05, throughput=5623 tok/s 2025-11-25 17:42:26,908 - INFO - Epoch 1 Step 7030 (Global: 7030): loss=1.7373, ppl=5.68, grad_norm=0.78, lr=2.89e-05, throughput=5616 tok/s 2025-11-25 17:43:52,357 - INFO - Epoch 1 Step 7040 (Global: 7040): loss=1.8558, ppl=6.40, grad_norm=0.82, lr=2.87e-05, throughput=5617 tok/s 2025-11-25 17:45:17,668 - INFO - Epoch 1 Step 7050 (Global: 7050): loss=1.7319, ppl=5.65, grad_norm=0.77, lr=2.86e-05, throughput=5627 tok/s 2025-11-25 17:46:43,058 - INFO - Epoch 1 Step 7060 (Global: 7060): loss=1.4503, ppl=4.26, grad_norm=0.72, lr=2.84e-05, throughput=5621 tok/s 2025-11-25 17:48:08,353 - INFO - Epoch 1 Step 7070 (Global: 7070): loss=1.6265, ppl=5.09, grad_norm=0.73, lr=2.83e-05, throughput=5628 tok/s 2025-11-25 17:49:33,788 - INFO - Epoch 1 Step 7080 (Global: 7080): loss=1.7828, ppl=5.95, grad_norm=0.77, lr=2.81e-05, throughput=5618 tok/s 2025-11-25 17:50:59,608 - INFO - Epoch 1 Step 7090 (Global: 7090): loss=1.7303, ppl=5.64, grad_norm=0.75, lr=2.80e-05, throughput=5593 tok/s 2025-11-25 17:52:25,293 - INFO - Epoch 1 Step 7100 (Global: 7100): loss=1.4036, ppl=4.07, grad_norm=0.72, lr=2.78e-05, throughput=5602 tok/s 2025-11-25 17:53:50,963 - INFO - Epoch 1 Step 7110 (Global: 7110): loss=1.5269, ppl=4.60, grad_norm=0.73, lr=2.77e-05, throughput=5603 tok/s 2025-11-25 17:55:16,471 - INFO - Epoch 1 Step 7120 (Global: 7120): loss=1.6756, ppl=5.34, grad_norm=0.79, lr=2.75e-05, throughput=5614 tok/s 2025-11-25 17:56:42,353 - INFO - Epoch 1 Step 7130 (Global: 7130): loss=1.7875, ppl=5.97, grad_norm=0.78, lr=2.74e-05, throughput=5589 tok/s 2025-11-25 17:58:07,551 - INFO - Epoch 1 Step 7140 (Global: 7140): loss=1.6250, ppl=5.08, grad_norm=0.77, lr=2.72e-05, throughput=5634 tok/s 2025-11-25 17:59:32,810 - INFO - Epoch 1 Step 7150 (Global: 7150): loss=1.7393, ppl=5.69, grad_norm=0.83, lr=2.71e-05, throughput=5630 tok/s 2025-11-25 18:00:58,532 - INFO - Epoch 1 Step 7160 (Global: 7160): loss=1.7309, ppl=5.65, grad_norm=0.75, lr=2.69e-05, throughput=5600 tok/s 2025-11-25 18:02:24,206 - INFO - Epoch 1 Step 7170 (Global: 7170): loss=1.5952, ppl=4.93, grad_norm=0.76, lr=2.68e-05, throughput=5603 tok/s 2025-11-25 18:03:49,977 - INFO - Epoch 1 Step 7180 (Global: 7180): loss=1.6981, ppl=5.46, grad_norm=0.75, lr=2.66e-05, throughput=5596 tok/s 2025-11-25 18:05:15,669 - INFO - Epoch 1 Step 7190 (Global: 7190): loss=1.6684, ppl=5.30, grad_norm=0.75, lr=2.65e-05, throughput=5602 tok/s 2025-11-25 18:06:41,218 - INFO - Epoch 1 Step 7200 (Global: 7200): loss=1.6422, ppl=5.17, grad_norm=0.75, lr=2.63e-05, throughput=5611 tok/s 2025-11-25 18:08:06,562 - INFO - Epoch 1 Step 7210 (Global: 7210): loss=1.9125, ppl=6.77, grad_norm=0.79, lr=2.62e-05, throughput=5624 tok/s 2025-11-25 18:09:32,051 - INFO - Epoch 1 Step 7220 (Global: 7220): loss=1.5910, ppl=4.91, grad_norm=0.73, lr=2.60e-05, throughput=5615 tok/s 2025-11-25 18:10:57,711 - INFO - Epoch 1 Step 7230 (Global: 7230): loss=1.6457, ppl=5.18, grad_norm=0.71, lr=2.59e-05, throughput=5604 tok/s 2025-11-25 18:12:23,159 - INFO - Epoch 1 Step 7240 (Global: 7240): loss=1.7826, ppl=5.95, grad_norm=0.79, lr=2.58e-05, throughput=5618 tok/s 2025-11-25 18:13:48,606 - INFO - Epoch 1 Step 7250 (Global: 7250): loss=1.7117, ppl=5.54, grad_norm=0.78, lr=2.56e-05, throughput=5618 tok/s 2025-11-25 18:15:13,533 - INFO - Epoch 1 Step 7260 (Global: 7260): loss=1.4703, ppl=4.35, grad_norm=0.72, lr=2.55e-05, throughput=5652 tok/s 2025-11-25 18:16:39,120 - INFO - Epoch 1 Step 7270 (Global: 7270): loss=1.5442, ppl=4.68, grad_norm=0.73, lr=2.53e-05, throughput=5608 tok/s 2025-11-25 18:18:04,811 - INFO - Epoch 1 Step 7280 (Global: 7280): loss=1.6755, ppl=5.34, grad_norm=0.77, lr=2.52e-05, throughput=5602 tok/s 2025-11-25 18:19:30,520 - INFO - Epoch 1 Step 7290 (Global: 7290): loss=1.6397, ppl=5.15, grad_norm=0.75, lr=2.50e-05, throughput=5600 tok/s 2025-11-25 18:20:56,461 - INFO - Epoch 1 Step 7300 (Global: 7300): loss=1.6155, ppl=5.03, grad_norm=0.74, lr=2.49e-05, throughput=5585 tok/s 2025-11-25 18:22:21,543 - INFO - Epoch 1 Step 7310 (Global: 7310): loss=1.6467, ppl=5.19, grad_norm=0.74, lr=2.47e-05, throughput=5642 tok/s 2025-11-25 18:23:47,180 - INFO - Epoch 1 Step 7320 (Global: 7320): loss=1.8035, ppl=6.07, grad_norm=0.76, lr=2.46e-05, throughput=5605 tok/s 2025-11-25 18:25:12,421 - INFO - Epoch 1 Step 7330 (Global: 7330): loss=1.3353, ppl=3.80, grad_norm=0.71, lr=2.44e-05, throughput=5631 tok/s 2025-11-25 18:26:37,634 - INFO - Epoch 1 Step 7340 (Global: 7340): loss=1.5405, ppl=4.67, grad_norm=0.72, lr=2.43e-05, throughput=5633 tok/s 2025-11-25 18:28:03,156 - INFO - Epoch 1 Step 7350 (Global: 7350): loss=1.4401, ppl=4.22, grad_norm=0.74, lr=2.42e-05, throughput=5613 tok/s 2025-11-25 18:29:28,522 - INFO - Epoch 1 Step 7360 (Global: 7360): loss=1.7858, ppl=5.96, grad_norm=0.76, lr=2.40e-05, throughput=5623 tok/s 2025-11-25 18:30:53,945 - INFO - Epoch 1 Step 7370 (Global: 7370): loss=1.6323, ppl=5.12, grad_norm=0.83, lr=2.39e-05, throughput=5619 tok/s 2025-11-25 18:32:19,564 - INFO - Epoch 1 Step 7380 (Global: 7380): loss=1.6105, ppl=5.01, grad_norm=0.73, lr=2.37e-05, throughput=5606 tok/s 2025-11-25 18:33:45,143 - INFO - Epoch 1 Step 7390 (Global: 7390): loss=1.7824, ppl=5.94, grad_norm=0.76, lr=2.36e-05, throughput=5609 tok/s 2025-11-25 18:35:10,459 - INFO - Epoch 1 Step 7400 (Global: 7400): loss=1.6572, ppl=5.24, grad_norm=0.76, lr=2.34e-05, throughput=5626 tok/s 2025-11-25 18:36:35,821 - INFO - Epoch 1 Step 7410 (Global: 7410): loss=1.7780, ppl=5.92, grad_norm=0.81, lr=2.33e-05, throughput=5623 tok/s 2025-11-25 18:38:00,997 - INFO - Epoch 1 Step 7420 (Global: 7420): loss=1.8113, ppl=6.12, grad_norm=0.78, lr=2.32e-05, throughput=5635 tok/s 2025-11-25 18:39:26,508 - INFO - Epoch 1 Step 7430 (Global: 7430): loss=1.6624, ppl=5.27, grad_norm=0.72, lr=2.30e-05, throughput=5613 tok/s 2025-11-25 18:40:52,034 - INFO - Epoch 1 Step 7440 (Global: 7440): loss=1.5949, ppl=4.93, grad_norm=0.73, lr=2.29e-05, throughput=5612 tok/s 2025-11-25 18:42:17,423 - INFO - Epoch 1 Step 7450 (Global: 7450): loss=1.7434, ppl=5.72, grad_norm=0.74, lr=2.27e-05, throughput=5621 tok/s 2025-11-25 18:43:42,845 - INFO - Epoch 1 Step 7460 (Global: 7460): loss=1.7165, ppl=5.56, grad_norm=0.74, lr=2.26e-05, throughput=5619 tok/s 2025-11-25 18:45:07,844 - INFO - Epoch 1 Step 7470 (Global: 7470): loss=1.4833, ppl=4.41, grad_norm=0.71, lr=2.25e-05, throughput=5647 tok/s 2025-11-25 18:46:33,378 - INFO - Epoch 1 Step 7480 (Global: 7480): loss=1.6171, ppl=5.04, grad_norm=0.75, lr=2.23e-05, throughput=5612 tok/s 2025-11-25 18:47:58,791 - INFO - Epoch 1 Step 7490 (Global: 7490): loss=1.6340, ppl=5.12, grad_norm=0.73, lr=2.22e-05, throughput=5620 tok/s 2025-11-25 18:49:24,071 - INFO - Epoch 1 Step 7500 (Global: 7500): loss=1.6602, ppl=5.26, grad_norm=0.77, lr=2.20e-05, throughput=5629 tok/s 2025-11-25 18:50:48,993 - INFO - Epoch 1 Step 7510 (Global: 7510): loss=1.4687, ppl=4.34, grad_norm=0.75, lr=2.19e-05, throughput=5652 tok/s 2025-11-25 18:52:14,484 - INFO - Epoch 1 Step 7520 (Global: 7520): loss=1.5647, ppl=4.78, grad_norm=0.78, lr=2.18e-05, throughput=5615 tok/s 2025-11-25 18:53:39,947 - INFO - Epoch 1 Step 7530 (Global: 7530): loss=1.4838, ppl=4.41, grad_norm=0.78, lr=2.16e-05, throughput=5617 tok/s 2025-11-25 18:55:05,074 - INFO - Epoch 1 Step 7540 (Global: 7540): loss=1.6243, ppl=5.07, grad_norm=0.78, lr=2.15e-05, throughput=5639 tok/s 2025-11-25 18:56:30,163 - INFO - Epoch 1 Step 7550 (Global: 7550): loss=1.6721, ppl=5.32, grad_norm=0.76, lr=2.14e-05, throughput=5641 tok/s 2025-11-25 18:57:55,331 - INFO - Epoch 1 Step 7560 (Global: 7560): loss=1.5804, ppl=4.86, grad_norm=0.73, lr=2.12e-05, throughput=5636 tok/s 2025-11-25 18:59:20,629 - INFO - Epoch 1 Step 7570 (Global: 7570): loss=1.6828, ppl=5.38, grad_norm=0.80, lr=2.11e-05, throughput=5627 tok/s 2025-11-25 19:00:45,893 - INFO - Epoch 1 Step 7580 (Global: 7580): loss=1.3782, ppl=3.97, grad_norm=0.71, lr=2.09e-05, throughput=5630 tok/s 2025-11-25 19:02:11,009 - INFO - Epoch 1 Step 7590 (Global: 7590): loss=1.6536, ppl=5.23, grad_norm=0.88, lr=2.08e-05, throughput=5639 tok/s 2025-11-25 19:03:36,354 - INFO - Epoch 1 Step 7600 (Global: 7600): loss=1.7160, ppl=5.56, grad_norm=0.76, lr=2.07e-05, throughput=5624 tok/s 2025-11-25 19:05:01,326 - INFO - Epoch 1 Step 7610 (Global: 7610): loss=1.6782, ppl=5.36, grad_norm=0.75, lr=2.05e-05, throughput=5649 tok/s 2025-11-25 19:06:26,564 - INFO - Epoch 1 Step 7620 (Global: 7620): loss=1.4860, ppl=4.42, grad_norm=0.81, lr=2.04e-05, throughput=5631 tok/s 2025-11-25 19:07:51,976 - INFO - Epoch 1 Step 7630 (Global: 7630): loss=1.6588, ppl=5.25, grad_norm=0.74, lr=2.03e-05, throughput=5620 tok/s 2025-11-25 19:09:17,324 - INFO - Epoch 1 Step 7640 (Global: 7640): loss=1.7365, ppl=5.68, grad_norm=0.75, lr=2.01e-05, throughput=5624 tok/s 2025-11-25 19:10:42,564 - INFO - Epoch 1 Step 7650 (Global: 7650): loss=1.5926, ppl=4.92, grad_norm=0.74, lr=2.00e-05, throughput=5631 tok/s 2025-11-25 19:12:08,119 - INFO - Epoch 1 Step 7660 (Global: 7660): loss=1.5941, ppl=4.92, grad_norm=0.73, lr=1.99e-05, throughput=5610 tok/s 2025-11-25 19:13:33,747 - INFO - Epoch 1 Step 7670 (Global: 7670): loss=1.5969, ppl=4.94, grad_norm=0.81, lr=1.97e-05, throughput=5606 tok/s 2025-11-25 19:14:58,605 - INFO - Epoch 1 Step 7680 (Global: 7680): loss=1.4464, ppl=4.25, grad_norm=0.73, lr=1.96e-05, throughput=5657 tok/s 2025-11-25 19:16:23,741 - INFO - Epoch 1 Step 7690 (Global: 7690): loss=1.4202, ppl=4.14, grad_norm=0.68, lr=1.95e-05, throughput=5638 tok/s 2025-11-25 19:17:48,977 - INFO - Epoch 1 Step 7700 (Global: 7700): loss=1.3846, ppl=3.99, grad_norm=0.71, lr=1.93e-05, throughput=5631 tok/s 2025-11-25 19:19:14,359 - INFO - Epoch 1 Step 7710 (Global: 7710): loss=1.5065, ppl=4.51, grad_norm=0.70, lr=1.92e-05, throughput=5622 tok/s 2025-11-25 19:20:40,037 - INFO - Epoch 1 Step 7720 (Global: 7720): loss=1.7349, ppl=5.67, grad_norm=0.76, lr=1.91e-05, throughput=5602 tok/s 2025-11-25 19:22:05,985 - INFO - Epoch 1 Step 7730 (Global: 7730): loss=1.8404, ppl=6.30, grad_norm=0.76, lr=1.89e-05, throughput=5585 tok/s 2025-11-25 19:23:31,779 - INFO - Epoch 1 Step 7740 (Global: 7740): loss=1.5784, ppl=4.85, grad_norm=0.75, lr=1.88e-05, throughput=5595 tok/s 2025-11-25 19:24:57,300 - INFO - Epoch 1 Step 7750 (Global: 7750): loss=1.5219, ppl=4.58, grad_norm=0.71, lr=1.87e-05, throughput=5613 tok/s 2025-11-25 19:26:22,752 - INFO - Epoch 1 Step 7760 (Global: 7760): loss=1.6955, ppl=5.45, grad_norm=0.76, lr=1.85e-05, throughput=5617 tok/s 2025-11-25 19:27:48,181 - INFO - Epoch 1 Step 7770 (Global: 7770): loss=1.4370, ppl=4.21, grad_norm=0.70, lr=1.84e-05, throughput=5619 tok/s 2025-11-25 19:29:13,681 - INFO - Epoch 1 Step 7780 (Global: 7780): loss=1.5192, ppl=4.57, grad_norm=0.71, lr=1.83e-05, throughput=5614 tok/s 2025-11-25 19:30:39,183 - INFO - Epoch 1 Step 7790 (Global: 7790): loss=1.4970, ppl=4.47, grad_norm=0.70, lr=1.82e-05, throughput=5614 tok/s 2025-11-25 19:32:04,555 - INFO - Epoch 1 Step 7800 (Global: 7800): loss=1.5440, ppl=4.68, grad_norm=0.72, lr=1.80e-05, throughput=5623 tok/s 2025-11-25 19:33:29,388 - INFO - Epoch 1 Step 7810 (Global: 7810): loss=1.7298, ppl=5.64, grad_norm=0.79, lr=1.79e-05, throughput=5658 tok/s 2025-11-25 19:34:54,461 - INFO - Epoch 1 Step 7820 (Global: 7820): loss=1.5661, ppl=4.79, grad_norm=0.73, lr=1.78e-05, throughput=5642 tok/s 2025-11-25 19:36:19,558 - INFO - Epoch 1 Step 7830 (Global: 7830): loss=1.7766, ppl=5.91, grad_norm=0.78, lr=1.76e-05, throughput=5641 tok/s 2025-11-25 19:37:45,359 - INFO - Epoch 1 Step 7840 (Global: 7840): loss=1.5877, ppl=4.89, grad_norm=0.79, lr=1.75e-05, throughput=5594 tok/s 2025-11-25 19:39:11,237 - INFO - Epoch 1 Step 7850 (Global: 7850): loss=1.5725, ppl=4.82, grad_norm=0.76, lr=1.74e-05, throughput=5589 tok/s 2025-11-25 19:40:36,565 - INFO - Epoch 1 Step 7860 (Global: 7860): loss=1.9461, ppl=7.00, grad_norm=0.77, lr=1.73e-05, throughput=5625 tok/s 2025-11-25 19:42:02,029 - INFO - Epoch 1 Step 7870 (Global: 7870): loss=1.4133, ppl=4.11, grad_norm=0.73, lr=1.71e-05, throughput=5616 tok/s 2025-11-25 19:43:27,127 - INFO - Epoch 1 Step 7880 (Global: 7880): loss=1.5666, ppl=4.79, grad_norm=0.71, lr=1.70e-05, throughput=5641 tok/s 2025-11-25 19:44:52,677 - INFO - Epoch 1 Step 7890 (Global: 7890): loss=1.6568, ppl=5.24, grad_norm=0.78, lr=1.69e-05, throughput=5611 tok/s 2025-11-25 19:46:18,212 - INFO - Epoch 1 Step 7900 (Global: 7900): loss=1.6138, ppl=5.02, grad_norm=0.82, lr=1.68e-05, throughput=5612 tok/s 2025-11-25 19:47:43,347 - INFO - Epoch 1 Step 7910 (Global: 7910): loss=1.7010, ppl=5.48, grad_norm=0.78, lr=1.66e-05, throughput=5638 tok/s 2025-11-25 19:49:08,655 - INFO - Epoch 1 Step 7920 (Global: 7920): loss=1.8387, ppl=6.29, grad_norm=0.77, lr=1.65e-05, throughput=5627 tok/s 2025-11-25 19:50:34,011 - INFO - Epoch 1 Step 7930 (Global: 7930): loss=1.8382, ppl=6.29, grad_norm=0.79, lr=1.64e-05, throughput=5624 tok/s 2025-11-25 19:51:59,271 - INFO - Epoch 1 Step 7940 (Global: 7940): loss=1.8696, ppl=6.49, grad_norm=0.79, lr=1.63e-05, throughput=5630 tok/s 2025-11-25 19:53:24,794 - INFO - Epoch 1 Step 7950 (Global: 7950): loss=1.6068, ppl=4.99, grad_norm=0.81, lr=1.61e-05, throughput=5613 tok/s 2025-11-25 19:54:53,390 - INFO - Epoch 1 Step 7960 (Global: 7960): loss=1.6021, ppl=4.96, grad_norm=0.77, lr=1.60e-05, throughput=5418 tok/s 2025-11-25 19:56:24,399 - INFO - Epoch 1 Step 7970 (Global: 7970): loss=1.4954, ppl=4.46, grad_norm=0.72, lr=1.59e-05, throughput=5274 tok/s 2025-11-25 19:57:50,136 - INFO - Epoch 1 Step 7980 (Global: 7980): loss=1.5540, ppl=4.73, grad_norm=0.76, lr=1.58e-05, throughput=5599 tok/s 2025-11-25 19:59:16,380 - INFO - Epoch 1 Step 7990 (Global: 7990): loss=1.6255, ppl=5.08, grad_norm=0.74, lr=1.56e-05, throughput=5566 tok/s 2025-11-25 20:00:43,277 - INFO - Epoch 1 Step 8000 (Global: 8000): loss=1.4816, ppl=4.40, grad_norm=0.71, lr=1.55e-05, throughput=5524 tok/s 2025-11-25 20:00:43,278 - INFO - Running validation at step 8000... 2025-11-25 20:04:55,229 - INFO - Validation loss: 1.6169, perplexity: 5.04 2025-11-25 20:05:20,857 - INFO - Saved checkpoint to outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-25 20:05:20,867 - INFO - New best validation loss: 1.6169, perplexity: 5.04 2025-11-25 20:06:47,311 - INFO - Epoch 1 Step 8010 (Global: 8010): loss=1.4533, ppl=4.28, grad_norm=0.74, lr=1.54e-05, throughput=5554 tok/s 2025-11-25 20:08:13,684 - INFO - Epoch 1 Step 8020 (Global: 8020): loss=1.6154, ppl=5.03, grad_norm=0.73, lr=1.53e-05, throughput=5557 tok/s 2025-11-25 20:09:40,745 - INFO - Epoch 1 Step 8030 (Global: 8030): loss=1.6293, ppl=5.10, grad_norm=0.81, lr=1.52e-05, throughput=5513 tok/s 2025-11-25 20:11:07,908 - INFO - Epoch 1 Step 8040 (Global: 8040): loss=1.3673, ppl=3.92, grad_norm=0.71, lr=1.50e-05, throughput=5507 tok/s 2025-11-25 20:12:34,584 - INFO - Epoch 1 Step 8050 (Global: 8050): loss=1.5964, ppl=4.94, grad_norm=0.79, lr=1.49e-05, throughput=5538 tok/s 2025-11-25 20:14:00,023 - INFO - Epoch 1 Step 8060 (Global: 8060): loss=1.6308, ppl=5.11, grad_norm=0.75, lr=1.48e-05, throughput=5618 tok/s 2025-11-25 20:15:25,564 - INFO - Epoch 1 Step 8070 (Global: 8070): loss=1.5364, ppl=4.65, grad_norm=0.72, lr=1.47e-05, throughput=5611 tok/s 2025-11-25 20:16:55,318 - INFO - Epoch 1 Step 8080 (Global: 8080): loss=1.4510, ppl=4.27, grad_norm=0.77, lr=1.46e-05, throughput=5348 tok/s 2025-11-25 20:18:22,802 - INFO - Epoch 1 Step 8090 (Global: 8090): loss=1.7397, ppl=5.70, grad_norm=0.74, lr=1.44e-05, throughput=5487 tok/s 2025-11-25 20:19:55,399 - INFO - Epoch 1 Step 8100 (Global: 8100): loss=1.4536, ppl=4.28, grad_norm=0.70, lr=1.43e-05, throughput=5184 tok/s 2025-11-25 20:21:25,567 - INFO - Epoch 1 Step 8110 (Global: 8110): loss=1.4514, ppl=4.27, grad_norm=0.68, lr=1.42e-05, throughput=5323 tok/s 2025-11-25 20:22:51,056 - INFO - Epoch 1 Step 8120 (Global: 8120): loss=1.5569, ppl=4.74, grad_norm=0.75, lr=1.41e-05, throughput=5615 tok/s 2025-11-25 20:24:20,623 - INFO - Epoch 1 Step 8130 (Global: 8130): loss=1.5055, ppl=4.51, grad_norm=0.73, lr=1.40e-05, throughput=5359 tok/s 2025-11-25 20:25:51,380 - INFO - Epoch 1 Step 8140 (Global: 8140): loss=1.7517, ppl=5.76, grad_norm=0.80, lr=1.39e-05, throughput=5289 tok/s 2025-11-25 20:27:16,960 - INFO - Epoch 1 Step 8150 (Global: 8150): loss=1.4664, ppl=4.33, grad_norm=0.74, lr=1.37e-05, throughput=5609 tok/s 2025-11-25 20:28:42,061 - INFO - Epoch 1 Step 8160 (Global: 8160): loss=1.6288, ppl=5.10, grad_norm=0.73, lr=1.36e-05, throughput=5640 tok/s 2025-11-25 20:30:10,813 - INFO - Epoch 1 Step 8170 (Global: 8170): loss=1.7347, ppl=5.67, grad_norm=0.75, lr=1.35e-05, throughput=5408 tok/s 2025-11-25 20:31:39,768 - INFO - Epoch 1 Step 8180 (Global: 8180): loss=1.6820, ppl=5.38, grad_norm=0.75, lr=1.34e-05, throughput=5396 tok/s 2025-11-25 20:33:06,651 - INFO - Epoch 1 Step 8190 (Global: 8190): loss=1.5042, ppl=4.50, grad_norm=0.73, lr=1.33e-05, throughput=5525 tok/s 2025-11-25 20:34:38,142 - INFO - Epoch 1 Step 8200 (Global: 8200): loss=1.5898, ppl=4.90, grad_norm=0.71, lr=1.32e-05, throughput=5247 tok/s 2025-11-25 20:36:08,330 - INFO - Epoch 1 Step 8210 (Global: 8210): loss=1.6467, ppl=5.19, grad_norm=0.73, lr=1.31e-05, throughput=5322 tok/s 2025-11-25 20:37:37,378 - INFO - Epoch 1 Step 8220 (Global: 8220): loss=1.5990, ppl=4.95, grad_norm=0.73, lr=1.29e-05, throughput=5390 tok/s 2025-11-25 20:39:07,611 - INFO - Epoch 1 Step 8230 (Global: 8230): loss=1.4486, ppl=4.26, grad_norm=0.72, lr=1.28e-05, throughput=5320 tok/s 2025-11-25 20:40:37,514 - INFO - Epoch 1 Step 8240 (Global: 8240): loss=1.6146, ppl=5.03, grad_norm=0.75, lr=1.27e-05, throughput=5339 tok/s 2025-11-25 20:42:05,466 - INFO - Epoch 1 Step 8250 (Global: 8250): loss=1.4925, ppl=4.45, grad_norm=0.70, lr=1.26e-05, throughput=5458 tok/s 2025-11-25 20:43:34,388 - INFO - Epoch 1 Step 8260 (Global: 8260): loss=1.8820, ppl=6.57, grad_norm=0.79, lr=1.25e-05, throughput=5398 tok/s 2025-11-25 20:45:02,983 - INFO - Epoch 1 Step 8270 (Global: 8270): loss=1.7283, ppl=5.63, grad_norm=0.77, lr=1.24e-05, throughput=5418 tok/s 2025-11-25 20:46:30,832 - INFO - Epoch 1 Step 8280 (Global: 8280): loss=1.7492, ppl=5.75, grad_norm=0.75, lr=1.23e-05, throughput=5464 tok/s 2025-11-25 20:47:58,420 - INFO - Epoch 1 Step 8290 (Global: 8290): loss=1.7486, ppl=5.75, grad_norm=0.78, lr=1.22e-05, throughput=5480 tok/s 2025-11-25 20:49:26,126 - INFO - Epoch 1 Step 8300 (Global: 8300): loss=1.6381, ppl=5.15, grad_norm=0.78, lr=1.21e-05, throughput=5473 tok/s 2025-11-25 20:50:54,926 - INFO - Epoch 1 Step 8310 (Global: 8310): loss=1.7119, ppl=5.54, grad_norm=0.75, lr=1.20e-05, throughput=5405 tok/s 2025-11-25 20:52:25,249 - INFO - Epoch 1 Step 8320 (Global: 8320): loss=1.5078, ppl=4.52, grad_norm=0.72, lr=1.18e-05, throughput=5315 tok/s 2025-11-25 20:53:55,698 - INFO - Epoch 1 Step 8330 (Global: 8330): loss=2.0424, ppl=7.71, grad_norm=0.81, lr=1.17e-05, throughput=5307 tok/s 2025-11-25 20:55:23,321 - INFO - Epoch 1 Step 8340 (Global: 8340): loss=1.4758, ppl=4.37, grad_norm=0.75, lr=1.16e-05, throughput=5478 tok/s 2025-11-25 20:56:52,009 - INFO - Epoch 1 Step 8350 (Global: 8350): loss=1.7300, ppl=5.64, grad_norm=0.78, lr=1.15e-05, throughput=5412 tok/s 2025-11-25 20:58:20,630 - INFO - Epoch 1 Step 8360 (Global: 8360): loss=1.7672, ppl=5.85, grad_norm=0.76, lr=1.14e-05, throughput=5416 tok/s 2025-11-25 20:59:51,847 - INFO - Epoch 1 Step 8370 (Global: 8370): loss=1.4865, ppl=4.42, grad_norm=0.71, lr=1.13e-05, throughput=5262 tok/s 2025-11-25 21:01:23,676 - INFO - Epoch 1 Step 8380 (Global: 8380): loss=1.6979, ppl=5.46, grad_norm=0.79, lr=1.12e-05, throughput=5227 tok/s 2025-11-25 21:02:53,298 - INFO - Epoch 1 Step 8390 (Global: 8390): loss=1.4842, ppl=4.41, grad_norm=0.71, lr=1.11e-05, throughput=5356 tok/s 2025-11-25 21:04:22,250 - INFO - Epoch 1 Step 8400 (Global: 8400): loss=1.8416, ppl=6.31, grad_norm=0.78, lr=1.10e-05, throughput=5396 tok/s 2025-11-25 21:05:51,532 - INFO - Epoch 1 Step 8410 (Global: 8410): loss=1.4921, ppl=4.45, grad_norm=0.74, lr=1.09e-05, throughput=5376 tok/s 2025-11-25 21:07:19,600 - INFO - Epoch 1 Step 8420 (Global: 8420): loss=1.7458, ppl=5.73, grad_norm=0.76, lr=1.08e-05, throughput=5450 tok/s 2025-11-25 21:07:32,878 - WARNING - socket.send() raised exception. 2025-11-25 21:07:32,878 - WARNING - socket.send() raised exception. 2025-11-25 21:07:33,031 - WARNING - socket.send() raised exception. 2025-11-25 21:07:33,031 - WARNING - socket.send() raised exception. 2025-11-25 21:07:33,031 - WARNING - socket.send() raised exception. 2025-11-25 21:07:33,031 - WARNING - socket.send() raised exception. 2025-11-25 21:07:33,031 - WARNING - socket.send() raised exception. 2025-11-25 21:07:33,031 - WARNING - socket.send() raised exception. 2025-11-26 20:50:29,960 - INFO - Starting training with args: Namespace(regime='text', data_path='data/training/splits_510k/train_arrow', output_dir='outputs/production_text_ctx277_lm_20251125_003839', objective='lm', val_data_path='data/training/splits_510k/val_arrow', max_samples=None, vision_mode='small', text_context_tokens=277, hybrid_text_tokens=0, vision_prompt=None, train_encoder=False, encoder_lr=1e-05, compression_window_size=9, compression_stride=9, subsample_strategy='regular', subsample_count=None, projection_dim=None, train_projection=False, compression_target=None, conv_kernel=5, timestamp='20251125_003839', batch_size=12, gradient_accumulation_steps=4, learning_rate=0.0001, weight_decay=0.01, num_epochs=1, warmup_ratio=0.1, max_grad_norm=1.0, log_steps=10, save_steps=0, eval_steps=2000, initial_validation=False, validation_only=False, no_checkpoints=False, num_qualitative_samples=0, max_generation_tokens=200, use_wandb=True, wandb_project='vision-compression-2', wandb_run_name='production_text_ctx277_lm_20251125_003839', resume_from_checkpoint='outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt', resume='outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt', init_from_checkpoint=None, allow_objective_switch=False, aux_loss_weight=0.5, num_workers=16, prefetch_factor=4, seed=42, eval_seed=42, debug_log_sample_ids=False, device='cuda', compile=False, compile_mode='default', use_optimized_model=True, use_encoder_checkpointing=True, use_decoder_checkpointing=True, use_8bit_optimizer=True) 2025-11-26 20:50:29,961 - INFO - Resuming training from checkpoint: outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-26 20:50:29,961 - INFO - Continuing outputs in directory: outputs/production_text_ctx277_lm_20251125_003839 2025-11-26 20:50:29,961 - INFO - Setting random seed: 42 2025-11-26 20:50:30,306 - INFO - Peeking checkpoint metadata from outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-26 20:50:35,104 - INFO - Checkpoint metadata: epoch=0, batch_idx=31999, global_step=8000 2025-11-26 20:50:35,104 - INFO - W&B run ID: y619ou6b 2025-11-26 20:50:35,150 - INFO - Checkpoint has WandB run ID: y619ou6b 2025-11-26 20:50:35,150 - INFO - Creating fresh WandB run (not resuming to avoid stale data) 2025-11-26 20:50:36,473 - INFO - Initialized W&B run: vision-compression-2/production_text_ctx277_lm_20251125_003839 (ID: xjb0fgkh) 2025-11-26 20:50:36,473 - INFO - Loading model and tokenizer... 2025-11-26 20:50:44,880 - INFO - Enabling decoder gradient checkpointing... 2025-11-26 20:50:44,886 - INFO - ✓ Decoder checkpointing enabled for 12 transformer layers 2025-11-26 20:50:44,887 - INFO - Expected: ~30-50% activation memory reduction, ~15-20% compute overhead 2025-11-26 20:50:44,912 - INFO - Created Text Baseline trainer 2025-11-26 20:50:44,912 - INFO - Training objective: lm 2025-11-26 20:50:44,938 - INFO - Logged parameter counts to W&B: total=2,934,734,080, trainable=2,934,734,080, encoder=0, decoder=2,934,734,080 2025-11-26 20:50:44,938 - INFO - Loading training data from data/training/splits_510k/train_arrow 2025-11-26 20:50:44,938 - INFO - Detected Arrow format: data/training/splits_510k/train_arrow 2025-11-26 20:50:44,938 - INFO - Loading Arrow dataset from data/training/splits_510k/train_arrow (memory-mapped) 2025-11-26 20:50:44,982 - INFO - Loaded 500,000 samples from data/training/splits_510k/train_arrow (memory-mapped) 2025-11-26 20:50:44,982 - INFO - Text baseline context tokens per sample: 277 2025-11-26 20:50:44,982 - INFO - Mid-epoch resume: skipping first 384000 samples at sampler level (batch 32000) 2025-11-26 20:50:45,061 - INFO - Loading validation data from data/training/splits_510k/val_arrow 2025-11-26 20:50:45,062 - INFO - Detected Arrow format: data/training/splits_510k/val_arrow 2025-11-26 20:50:45,062 - INFO - Loading Arrow dataset from data/training/splits_510k/val_arrow (memory-mapped) 2025-11-26 20:50:45,068 - INFO - Loaded 10,000 samples from data/training/splits_510k/val_arrow (memory-mapped) 2025-11-26 20:50:45,068 - INFO - Validation text context tokens per sample: 277 2025-11-26 20:50:47,138 - INFO - Created 8-bit AdamW optimizer (bitsandbytes): Learning rate: 0.0001 Memory savings: ~75% optimizer state (16.8GB for 2.8B params) Expected overhead: ~2-5% 2025-11-26 20:50:47,138 - INFO - Created scheduler with warmup_steps=1041, total_steps=10417 2025-11-26 20:50:47,146 - INFO - Logged optimizer config to W&B: type=adamw_8bit, memory=5.47GB 2025-11-26 20:50:47,146 - INFO - Loading checkpoint state (model/optimizer/scheduler) from outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-26 20:50:54,123 - INFO - ✓ Successfully loaded optimizer state from checkpoint 2025-11-26 20:50:54,123 - INFO - ✓ Successfully loaded scheduler state from checkpoint 2025-11-26 20:50:54,124 - WARNING - Failed to restore RNG states: RNG state must be a torch.ByteTensor. Continuing with current RNG state. 2025-11-26 20:50:54,151 - INFO - Restored training state: epoch=0, batch_idx=31999, global_step=8000, best_val_loss=1.6169 2025-11-26 20:50:54,152 - INFO - Resuming mid-epoch: will skip first 32000 batches of epoch 0 2025-11-26 20:50:54,152 - INFO - Starting training loop... 2025-11-26 20:50:54,153 - INFO - ====================================================================== 2025-11-26 20:50:54,153 - INFO - Epoch 1/1 2025-11-26 20:50:54,153 - INFO - ====================================================================== 2025-11-26 20:50:59,382 - WARNING - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers. 2025-11-26 20:51:00,631 - INFO - Effective context tokens (per-sample): 278 | Compression ratio: 3.60x 2025-11-26 20:51:00,631 - INFO - Target tokens per sample: 1000 2025-11-26 20:52:25,972 - INFO - Epoch 1 Step 10 (Global: 8010): loss=1.4532, ppl=4.28, grad_norm=0.74, lr=1.54e-05, throughput=5228 tok/s 2025-11-26 20:53:51,307 - INFO - Epoch 1 Step 20 (Global: 8020): loss=1.6158, ppl=5.03, grad_norm=0.73, lr=1.53e-05, throughput=5625 tok/s 2025-11-26 20:55:16,479 - INFO - Epoch 1 Step 30 (Global: 8030): loss=1.6289, ppl=5.10, grad_norm=0.81, lr=1.52e-05, throughput=5636 tok/s 2025-11-26 20:56:41,437 - INFO - Epoch 1 Step 40 (Global: 8040): loss=1.3674, ppl=3.93, grad_norm=0.71, lr=1.50e-05, throughput=5650 tok/s 2025-11-26 20:58:06,641 - INFO - Epoch 1 Step 50 (Global: 8050): loss=1.5966, ppl=4.94, grad_norm=0.79, lr=1.49e-05, throughput=5634 tok/s 2025-11-26 20:59:32,329 - INFO - Epoch 1 Step 60 (Global: 8060): loss=1.6306, ppl=5.11, grad_norm=0.75, lr=1.48e-05, throughput=5602 tok/s 2025-11-26 21:00:57,380 - INFO - Epoch 1 Step 70 (Global: 8070): loss=1.5364, ppl=4.65, grad_norm=0.72, lr=1.47e-05, throughput=5644 tok/s 2025-11-26 21:02:22,518 - INFO - Epoch 1 Step 80 (Global: 8080): loss=1.4508, ppl=4.27, grad_norm=0.77, lr=1.46e-05, throughput=5638 tok/s 2025-11-26 21:03:47,861 - INFO - Epoch 1 Step 90 (Global: 8090): loss=1.7397, ppl=5.70, grad_norm=0.74, lr=1.44e-05, throughput=5624 tok/s 2025-11-26 21:05:12,852 - INFO - Epoch 1 Step 100 (Global: 8100): loss=1.4535, ppl=4.28, grad_norm=0.70, lr=1.43e-05, throughput=5648 tok/s 2025-11-26 21:06:38,090 - INFO - Epoch 1 Step 110 (Global: 8110): loss=1.4510, ppl=4.27, grad_norm=0.68, lr=1.42e-05, throughput=5631 tok/s 2025-11-26 21:08:02,810 - INFO - Epoch 1 Step 120 (Global: 8120): loss=1.5567, ppl=4.74, grad_norm=0.75, lr=1.41e-05, throughput=5666 tok/s 2025-11-26 21:09:27,858 - INFO - Epoch 1 Step 130 (Global: 8130): loss=1.5054, ppl=4.51, grad_norm=0.73, lr=1.40e-05, throughput=5644 tok/s 2025-11-26 21:10:52,906 - INFO - Epoch 1 Step 140 (Global: 8140): loss=1.7518, ppl=5.77, grad_norm=0.80, lr=1.39e-05, throughput=5644 tok/s 2025-11-26 21:12:17,627 - INFO - Epoch 1 Step 150 (Global: 8150): loss=1.4663, ppl=4.33, grad_norm=0.74, lr=1.37e-05, throughput=5666 tok/s 2025-11-26 21:13:42,966 - INFO - Epoch 1 Step 160 (Global: 8160): loss=1.6288, ppl=5.10, grad_norm=0.73, lr=1.36e-05, throughput=5625 tok/s 2025-11-26 21:15:08,711 - INFO - Epoch 1 Step 170 (Global: 8170): loss=1.7347, ppl=5.67, grad_norm=0.75, lr=1.35e-05, throughput=5598 tok/s 2025-11-26 21:16:33,739 - INFO - Epoch 1 Step 180 (Global: 8180): loss=1.6823, ppl=5.38, grad_norm=0.75, lr=1.34e-05, throughput=5645 tok/s 2025-11-26 21:17:58,642 - INFO - Epoch 1 Step 190 (Global: 8190): loss=1.5038, ppl=4.50, grad_norm=0.73, lr=1.33e-05, throughput=5654 tok/s 2025-11-26 21:19:24,034 - INFO - Epoch 1 Step 200 (Global: 8200): loss=1.5895, ppl=4.90, grad_norm=0.71, lr=1.32e-05, throughput=5621 tok/s 2025-11-26 21:20:49,192 - INFO - Epoch 1 Step 210 (Global: 8210): loss=1.6468, ppl=5.19, grad_norm=0.73, lr=1.31e-05, throughput=5637 tok/s 2025-11-26 21:22:14,084 - INFO - Epoch 1 Step 220 (Global: 8220): loss=1.5994, ppl=4.95, grad_norm=0.73, lr=1.29e-05, throughput=5654 tok/s 2025-11-26 21:23:39,034 - INFO - Epoch 1 Step 230 (Global: 8230): loss=1.4486, ppl=4.26, grad_norm=0.72, lr=1.28e-05, throughput=5650 tok/s 2025-11-26 21:25:03,896 - INFO - Epoch 1 Step 240 (Global: 8240): loss=1.6148, ppl=5.03, grad_norm=0.75, lr=1.27e-05, throughput=5656 tok/s 2025-11-26 21:26:29,154 - INFO - Epoch 1 Step 250 (Global: 8250): loss=1.4924, ppl=4.45, grad_norm=0.70, lr=1.26e-05, throughput=5630 tok/s 2025-11-26 21:27:54,299 - INFO - Epoch 1 Step 260 (Global: 8260): loss=1.8818, ppl=6.57, grad_norm=0.79, lr=1.25e-05, throughput=5638 tok/s 2025-11-26 21:29:19,074 - INFO - Epoch 1 Step 270 (Global: 8270): loss=1.7283, ppl=5.63, grad_norm=0.77, lr=1.24e-05, throughput=5662 tok/s 2025-11-26 21:30:43,916 - INFO - Epoch 1 Step 280 (Global: 8280): loss=1.7495, ppl=5.75, grad_norm=0.75, lr=1.23e-05, throughput=5658 tok/s 2025-11-26 21:32:08,765 - INFO - Epoch 1 Step 290 (Global: 8290): loss=1.7489, ppl=5.75, grad_norm=0.78, lr=1.22e-05, throughput=5657 tok/s 2025-11-26 21:33:33,701 - INFO - Epoch 1 Step 300 (Global: 8300): loss=1.6379, ppl=5.14, grad_norm=0.78, lr=1.21e-05, throughput=5651 tok/s 2025-11-26 21:34:58,542 - INFO - Epoch 1 Step 310 (Global: 8310): loss=1.7117, ppl=5.54, grad_norm=0.75, lr=1.20e-05, throughput=5658 tok/s 2025-11-26 21:36:23,558 - INFO - Epoch 1 Step 320 (Global: 8320): loss=1.5073, ppl=4.51, grad_norm=0.72, lr=1.18e-05, throughput=5646 tok/s 2025-11-26 21:37:48,302 - INFO - Epoch 1 Step 330 (Global: 8330): loss=2.0423, ppl=7.71, grad_norm=0.81, lr=1.17e-05, throughput=5664 tok/s 2025-11-26 21:39:13,566 - INFO - Epoch 1 Step 340 (Global: 8340): loss=1.4762, ppl=4.38, grad_norm=0.75, lr=1.16e-05, throughput=5630 tok/s 2025-11-26 21:40:38,593 - INFO - Epoch 1 Step 350 (Global: 8350): loss=1.7298, ppl=5.64, grad_norm=0.78, lr=1.15e-05, throughput=5645 tok/s 2025-11-26 21:42:03,866 - INFO - Epoch 1 Step 360 (Global: 8360): loss=1.7675, ppl=5.86, grad_norm=0.76, lr=1.14e-05, throughput=5629 tok/s 2025-11-26 21:43:29,186 - INFO - Epoch 1 Step 370 (Global: 8370): loss=1.4863, ppl=4.42, grad_norm=0.71, lr=1.13e-05, throughput=5626 tok/s 2025-11-26 21:44:54,431 - INFO - Epoch 1 Step 380 (Global: 8380): loss=1.6981, ppl=5.46, grad_norm=0.79, lr=1.12e-05, throughput=5631 tok/s 2025-11-26 21:46:19,501 - INFO - Epoch 1 Step 390 (Global: 8390): loss=1.4841, ppl=4.41, grad_norm=0.71, lr=1.11e-05, throughput=5642 tok/s 2025-11-26 21:47:44,732 - INFO - Epoch 1 Step 400 (Global: 8400): loss=1.8415, ppl=6.31, grad_norm=0.78, lr=1.10e-05, throughput=5632 tok/s 2025-11-26 21:49:09,656 - INFO - Epoch 1 Step 410 (Global: 8410): loss=1.4925, ppl=4.45, grad_norm=0.74, lr=1.09e-05, throughput=5652 tok/s 2025-11-26 21:50:34,853 - INFO - Epoch 1 Step 420 (Global: 8420): loss=1.7460, ppl=5.73, grad_norm=0.76, lr=1.08e-05, throughput=5634 tok/s 2025-11-26 21:51:59,463 - INFO - Epoch 1 Step 430 (Global: 8430): loss=1.3862, ppl=4.00, grad_norm=0.71, lr=1.07e-05, throughput=5673 tok/s 2025-11-26 21:53:24,504 - INFO - Epoch 1 Step 440 (Global: 8440): loss=1.5567, ppl=4.74, grad_norm=0.75, lr=1.06e-05, throughput=5644 tok/s 2025-11-26 21:54:49,510 - INFO - Epoch 1 Step 450 (Global: 8450): loss=1.5543, ppl=4.73, grad_norm=0.72, lr=1.05e-05, throughput=5647 tok/s 2025-11-26 21:56:15,556 - INFO - Epoch 1 Step 460 (Global: 8460): loss=1.6978, ppl=5.46, grad_norm=0.85, lr=1.04e-05, throughput=5579 tok/s 2025-11-26 21:57:40,769 - INFO - Epoch 1 Step 470 (Global: 8470): loss=1.3808, ppl=3.98, grad_norm=0.72, lr=1.03e-05, throughput=5633 tok/s 2025-11-26 21:59:06,232 - INFO - Epoch 1 Step 480 (Global: 8480): loss=1.7441, ppl=5.72, grad_norm=0.75, lr=1.02e-05, throughput=5617 tok/s 2025-11-26 22:00:31,579 - INFO - Epoch 1 Step 490 (Global: 8490): loss=1.6410, ppl=5.16, grad_norm=0.73, lr=1.01e-05, throughput=5624 tok/s 2025-11-26 22:01:57,143 - INFO - Epoch 1 Step 500 (Global: 8500): loss=1.6033, ppl=4.97, grad_norm=0.75, lr=9.96e-06, throughput=5610 tok/s 2025-11-26 22:03:23,038 - INFO - Epoch 1 Step 510 (Global: 8510): loss=1.5717, ppl=4.81, grad_norm=0.76, lr=9.86e-06, throughput=5588 tok/s 2025-11-26 22:04:49,558 - INFO - Epoch 1 Step 520 (Global: 8520): loss=1.8225, ppl=6.19, grad_norm=0.82, lr=9.76e-06, throughput=5548 tok/s 2025-11-26 22:06:15,088 - INFO - Epoch 1 Step 530 (Global: 8530): loss=1.6096, ppl=5.00, grad_norm=0.77, lr=9.67e-06, throughput=5612 tok/s 2025-11-26 22:07:40,229 - INFO - Epoch 1 Step 540 (Global: 8540): loss=1.5842, ppl=4.88, grad_norm=0.73, lr=9.57e-06, throughput=5638 tok/s 2025-11-26 22:09:07,277 - INFO - Epoch 1 Step 550 (Global: 8550): loss=1.2977, ppl=3.66, grad_norm=0.68, lr=9.47e-06, throughput=5514 tok/s 2025-11-26 22:10:32,333 - INFO - Epoch 1 Step 560 (Global: 8560): loss=1.8438, ppl=6.32, grad_norm=0.77, lr=9.37e-06, throughput=5643 tok/s 2025-11-26 22:11:57,831 - INFO - Epoch 1 Step 570 (Global: 8570): loss=1.5718, ppl=4.82, grad_norm=0.75, lr=9.27e-06, throughput=5614 tok/s 2025-11-26 22:13:23,684 - INFO - Epoch 1 Step 580 (Global: 8580): loss=1.4600, ppl=4.31, grad_norm=0.73, lr=9.18e-06, throughput=5591 tok/s 2025-11-26 22:14:49,361 - INFO - Epoch 1 Step 590 (Global: 8590): loss=1.3806, ppl=3.98, grad_norm=0.68, lr=9.08e-06, throughput=5602 tok/s 2025-11-26 22:16:14,673 - INFO - Epoch 1 Step 600 (Global: 8600): loss=1.5684, ppl=4.80, grad_norm=0.74, lr=8.98e-06, throughput=5626 tok/s 2025-11-26 22:17:40,332 - INFO - Epoch 1 Step 610 (Global: 8610): loss=1.7098, ppl=5.53, grad_norm=0.77, lr=8.89e-06, throughput=5604 tok/s 2025-11-26 22:19:05,667 - INFO - Epoch 1 Step 620 (Global: 8620): loss=1.6319, ppl=5.11, grad_norm=0.80, lr=8.79e-06, throughput=5625 tok/s 2025-11-26 22:20:31,132 - INFO - Epoch 1 Step 630 (Global: 8630): loss=1.8371, ppl=6.28, grad_norm=0.77, lr=8.70e-06, throughput=5616 tok/s 2025-11-26 22:21:56,472 - INFO - Epoch 1 Step 640 (Global: 8640): loss=1.6559, ppl=5.24, grad_norm=0.74, lr=8.60e-06, throughput=5625 tok/s 2025-11-26 22:23:21,934 - INFO - Epoch 1 Step 650 (Global: 8650): loss=1.5944, ppl=4.93, grad_norm=0.71, lr=8.51e-06, throughput=5617 tok/s 2025-11-26 22:24:47,089 - INFO - Epoch 1 Step 660 (Global: 8660): loss=1.5233, ppl=4.59, grad_norm=0.73, lr=8.42e-06, throughput=5637 tok/s 2025-11-26 22:26:12,723 - INFO - Epoch 1 Step 670 (Global: 8670): loss=1.6880, ppl=5.41, grad_norm=0.77, lr=8.32e-06, throughput=5605 tok/s 2025-11-26 22:27:38,087 - INFO - Epoch 1 Step 680 (Global: 8680): loss=1.6277, ppl=5.09, grad_norm=0.76, lr=8.23e-06, throughput=5623 tok/s 2025-11-26 22:29:03,321 - INFO - Epoch 1 Step 690 (Global: 8690): loss=1.4604, ppl=4.31, grad_norm=0.75, lr=8.14e-06, throughput=5632 tok/s 2025-11-26 22:30:28,448 - INFO - Epoch 1 Step 700 (Global: 8700): loss=1.5887, ppl=4.90, grad_norm=0.76, lr=8.05e-06, throughput=5639 tok/s 2025-11-26 22:31:53,203 - INFO - Epoch 1 Step 710 (Global: 8710): loss=1.5411, ppl=4.67, grad_norm=0.73, lr=7.96e-06, throughput=5663 tok/s 2025-11-26 22:33:18,130 - INFO - Epoch 1 Step 720 (Global: 8720): loss=1.6024, ppl=4.97, grad_norm=0.73, lr=7.87e-06, throughput=5652 tok/s 2025-11-26 22:34:43,332 - INFO - Epoch 1 Step 730 (Global: 8730): loss=1.3351, ppl=3.80, grad_norm=0.67, lr=7.78e-06, throughput=5634 tok/s 2025-11-26 22:36:08,356 - INFO - Epoch 1 Step 740 (Global: 8740): loss=1.7768, ppl=5.91, grad_norm=0.75, lr=7.69e-06, throughput=5646 tok/s 2025-11-26 22:37:33,464 - INFO - Epoch 1 Step 750 (Global: 8750): loss=1.6043, ppl=4.97, grad_norm=0.75, lr=7.60e-06, throughput=5640 tok/s 2025-11-26 22:38:58,975 - INFO - Epoch 1 Step 760 (Global: 8760): loss=1.5632, ppl=4.77, grad_norm=0.74, lr=7.51e-06, throughput=5613 tok/s 2025-11-26 22:40:24,702 - INFO - Epoch 1 Step 770 (Global: 8770): loss=1.7290, ppl=5.63, grad_norm=0.74, lr=7.42e-06, throughput=5599 tok/s 2025-11-26 22:41:49,921 - INFO - Epoch 1 Step 780 (Global: 8780): loss=1.5936, ppl=4.92, grad_norm=0.73, lr=7.33e-06, throughput=5633 tok/s 2025-11-26 22:43:15,676 - INFO - Epoch 1 Step 790 (Global: 8790): loss=1.6572, ppl=5.24, grad_norm=0.77, lr=7.25e-06, throughput=5597 tok/s 2025-11-26 22:44:40,934 - INFO - Epoch 1 Step 800 (Global: 8800): loss=1.4783, ppl=4.39, grad_norm=0.73, lr=7.16e-06, throughput=5630 tok/s 2025-11-26 22:46:06,012 - INFO - Epoch 1 Step 810 (Global: 8810): loss=1.6520, ppl=5.22, grad_norm=0.73, lr=7.07e-06, throughput=5642 tok/s 2025-11-26 22:47:31,259 - INFO - Epoch 1 Step 820 (Global: 8820): loss=1.5957, ppl=4.93, grad_norm=0.75, lr=6.99e-06, throughput=5631 tok/s 2025-11-26 22:48:56,159 - INFO - Epoch 1 Step 830 (Global: 8830): loss=1.7374, ppl=5.68, grad_norm=0.76, lr=6.90e-06, throughput=5654 tok/s 2025-11-26 22:50:21,281 - INFO - Epoch 1 Step 840 (Global: 8840): loss=1.3788, ppl=3.97, grad_norm=0.70, lr=6.82e-06, throughput=5639 tok/s 2025-11-26 22:51:46,379 - INFO - Epoch 1 Step 850 (Global: 8850): loss=1.8737, ppl=6.51, grad_norm=0.79, lr=6.74e-06, throughput=5641 tok/s 2025-11-26 22:53:11,351 - INFO - Epoch 1 Step 860 (Global: 8860): loss=1.7718, ppl=5.88, grad_norm=0.78, lr=6.65e-06, throughput=5649 tok/s 2025-11-26 22:54:36,996 - INFO - Epoch 1 Step 870 (Global: 8870): loss=1.7481, ppl=5.74, grad_norm=0.76, lr=6.57e-06, throughput=5605 tok/s 2025-11-26 22:56:02,118 - INFO - Epoch 1 Step 880 (Global: 8880): loss=1.6870, ppl=5.40, grad_norm=0.73, lr=6.49e-06, throughput=5639 tok/s 2025-11-26 22:57:27,190 - INFO - Epoch 1 Step 890 (Global: 8890): loss=1.4810, ppl=4.40, grad_norm=0.74, lr=6.40e-06, throughput=5642 tok/s 2025-11-26 22:58:52,909 - INFO - Epoch 1 Step 900 (Global: 8900): loss=1.6118, ppl=5.01, grad_norm=0.75, lr=6.32e-06, throughput=5600 tok/s 2025-11-26 23:00:18,402 - INFO - Epoch 1 Step 910 (Global: 8910): loss=1.6649, ppl=5.29, grad_norm=0.74, lr=6.24e-06, throughput=5615 tok/s 2025-11-26 23:01:43,895 - INFO - Epoch 1 Step 920 (Global: 8920): loss=1.4724, ppl=4.36, grad_norm=0.73, lr=6.16e-06, throughput=5615 tok/s 2025-11-26 23:03:09,051 - INFO - Epoch 1 Step 930 (Global: 8930): loss=1.4034, ppl=4.07, grad_norm=0.72, lr=6.08e-06, throughput=5637 tok/s 2025-11-26 23:04:34,237 - INFO - Epoch 1 Step 940 (Global: 8940): loss=1.7816, ppl=5.94, grad_norm=0.74, lr=6.00e-06, throughput=5635 tok/s 2025-11-26 23:05:59,590 - INFO - Epoch 1 Step 950 (Global: 8950): loss=1.4301, ppl=4.18, grad_norm=0.69, lr=5.92e-06, throughput=5624 tok/s 2025-11-26 23:07:24,862 - INFO - Epoch 1 Step 960 (Global: 8960): loss=1.4595, ppl=4.30, grad_norm=0.72, lr=5.84e-06, throughput=5629 tok/s 2025-11-26 23:08:50,386 - INFO - Epoch 1 Step 970 (Global: 8970): loss=1.7122, ppl=5.54, grad_norm=0.76, lr=5.76e-06, throughput=5613 tok/s 2025-11-26 23:10:15,816 - INFO - Epoch 1 Step 980 (Global: 8980): loss=1.8096, ppl=6.11, grad_norm=0.78, lr=5.68e-06, throughput=5619 tok/s 2025-11-26 23:11:41,297 - INFO - Epoch 1 Step 990 (Global: 8990): loss=1.6254, ppl=5.08, grad_norm=0.74, lr=5.61e-06, throughput=5615 tok/s 2025-11-26 23:13:06,330 - INFO - Epoch 1 Step 1000 (Global: 9000): loss=1.8182, ppl=6.16, grad_norm=0.75, lr=5.53e-06, throughput=5645 tok/s 2025-11-26 23:14:31,565 - INFO - Epoch 1 Step 1010 (Global: 9010): loss=1.4094, ppl=4.09, grad_norm=0.72, lr=5.45e-06, throughput=5632 tok/s 2025-11-26 23:15:57,058 - INFO - Epoch 1 Step 1020 (Global: 9020): loss=1.6192, ppl=5.05, grad_norm=0.74, lr=5.38e-06, throughput=5615 tok/s 2025-11-26 23:17:22,556 - INFO - Epoch 1 Step 1030 (Global: 9030): loss=1.6475, ppl=5.19, grad_norm=0.73, lr=5.30e-06, throughput=5614 tok/s 2025-11-26 23:18:47,553 - INFO - Epoch 1 Step 1040 (Global: 9040): loss=1.7873, ppl=5.97, grad_norm=0.78, lr=5.23e-06, throughput=5647 tok/s 2025-11-26 23:20:12,861 - INFO - Epoch 1 Step 1050 (Global: 9050): loss=1.6791, ppl=5.36, grad_norm=0.74, lr=5.15e-06, throughput=5627 tok/s 2025-11-26 23:21:38,205 - INFO - Epoch 1 Step 1060 (Global: 9060): loss=1.7448, ppl=5.72, grad_norm=0.77, lr=5.08e-06, throughput=5624 tok/s 2025-11-26 23:23:03,354 - INFO - Epoch 1 Step 1070 (Global: 9070): loss=1.7018, ppl=5.48, grad_norm=0.76, lr=5.01e-06, throughput=5637 tok/s 2025-11-26 23:24:28,752 - INFO - Epoch 1 Step 1080 (Global: 9080): loss=1.6771, ppl=5.35, grad_norm=0.75, lr=4.93e-06, throughput=5621 tok/s 2025-11-26 23:25:54,470 - INFO - Epoch 1 Step 1090 (Global: 9090): loss=1.7590, ppl=5.81, grad_norm=0.77, lr=4.86e-06, throughput=5600 tok/s 2025-11-26 23:27:19,772 - INFO - Epoch 1 Step 1100 (Global: 9100): loss=1.6589, ppl=5.25, grad_norm=0.73, lr=4.79e-06, throughput=5627 tok/s 2025-11-26 23:28:44,930 - INFO - Epoch 1 Step 1110 (Global: 9110): loss=1.5364, ppl=4.65, grad_norm=0.70, lr=4.72e-06, throughput=5637 tok/s 2025-11-26 23:30:10,149 - INFO - Epoch 1 Step 1120 (Global: 9120): loss=1.4065, ppl=4.08, grad_norm=0.71, lr=4.65e-06, throughput=5633 tok/s 2025-11-26 23:31:35,384 - INFO - Epoch 1 Step 1130 (Global: 9130): loss=1.4872, ppl=4.42, grad_norm=0.77, lr=4.58e-06, throughput=5632 tok/s 2025-11-26 23:33:00,642 - INFO - Epoch 1 Step 1140 (Global: 9140): loss=1.7200, ppl=5.58, grad_norm=0.75, lr=4.51e-06, throughput=5630 tok/s 2025-11-26 23:34:25,961 - INFO - Epoch 1 Step 1150 (Global: 9150): loss=1.6824, ppl=5.38, grad_norm=0.77, lr=4.44e-06, throughput=5626 tok/s 2025-11-26 23:35:51,197 - INFO - Epoch 1 Step 1160 (Global: 9160): loss=1.4691, ppl=4.35, grad_norm=0.72, lr=4.37e-06, throughput=5631 tok/s 2025-11-26 23:37:16,546 - INFO - Epoch 1 Step 1170 (Global: 9170): loss=1.2970, ppl=3.66, grad_norm=0.68, lr=4.30e-06, throughput=5624 tok/s 2025-11-26 23:38:41,606 - INFO - Epoch 1 Step 1180 (Global: 9180): loss=1.6496, ppl=5.20, grad_norm=0.79, lr=4.23e-06, throughput=5643 tok/s 2025-11-26 23:40:06,779 - INFO - Epoch 1 Step 1190 (Global: 9190): loss=1.5311, ppl=4.62, grad_norm=0.73, lr=4.17e-06, throughput=5636 tok/s 2025-11-26 23:41:31,896 - INFO - Epoch 1 Step 1200 (Global: 9200): loss=1.5790, ppl=4.85, grad_norm=0.75, lr=4.10e-06, throughput=5639 tok/s 2025-11-26 23:42:57,071 - INFO - Epoch 1 Step 1210 (Global: 9210): loss=1.6305, ppl=5.11, grad_norm=0.76, lr=4.03e-06, throughput=5636 tok/s 2025-11-26 23:44:22,364 - INFO - Epoch 1 Step 1220 (Global: 9220): loss=1.5104, ppl=4.53, grad_norm=0.74, lr=3.97e-06, throughput=5628 tok/s 2025-11-26 23:45:47,516 - INFO - Epoch 1 Step 1230 (Global: 9230): loss=1.6247, ppl=5.08, grad_norm=0.74, lr=3.90e-06, throughput=5637 tok/s 2025-11-26 23:47:12,600 - INFO - Epoch 1 Step 1240 (Global: 9240): loss=1.5780, ppl=4.85, grad_norm=0.71, lr=3.84e-06, throughput=5642 tok/s 2025-11-26 23:48:37,728 - INFO - Epoch 1 Step 1250 (Global: 9250): loss=1.7499, ppl=5.75, grad_norm=0.75, lr=3.77e-06, throughput=5639 tok/s 2025-11-26 23:50:02,853 - INFO - Epoch 1 Step 1260 (Global: 9260): loss=1.6167, ppl=5.04, grad_norm=0.75, lr=3.71e-06, throughput=5639 tok/s 2025-11-26 23:51:27,928 - INFO - Epoch 1 Step 1270 (Global: 9270): loss=1.7025, ppl=5.49, grad_norm=0.75, lr=3.65e-06, throughput=5642 tok/s 2025-11-26 23:52:53,060 - INFO - Epoch 1 Step 1280 (Global: 9280): loss=1.6551, ppl=5.23, grad_norm=0.76, lr=3.58e-06, throughput=5638 tok/s 2025-11-26 23:54:18,223 - INFO - Epoch 1 Step 1290 (Global: 9290): loss=1.6359, ppl=5.13, grad_norm=0.77, lr=3.52e-06, throughput=5636 tok/s 2025-11-26 23:55:43,373 - INFO - Epoch 1 Step 1300 (Global: 9300): loss=1.5904, ppl=4.91, grad_norm=0.70, lr=3.46e-06, throughput=5637 tok/s 2025-11-26 23:57:08,224 - INFO - Epoch 1 Step 1310 (Global: 9310): loss=1.5292, ppl=4.61, grad_norm=0.73, lr=3.40e-06, throughput=5657 tok/s 2025-11-26 23:58:33,294 - INFO - Epoch 1 Step 1320 (Global: 9320): loss=1.8017, ppl=6.06, grad_norm=0.77, lr=3.34e-06, throughput=5642 tok/s 2025-11-26 23:59:58,315 - INFO - Epoch 1 Step 1330 (Global: 9330): loss=1.4376, ppl=4.21, grad_norm=0.66, lr=3.28e-06, throughput=5646 tok/s 2025-11-27 00:01:23,632 - INFO - Epoch 1 Step 1340 (Global: 9340): loss=1.6803, ppl=5.37, grad_norm=0.75, lr=3.22e-06, throughput=5626 tok/s 2025-11-27 00:02:48,711 - INFO - Epoch 1 Step 1350 (Global: 9350): loss=1.5311, ppl=4.62, grad_norm=0.73, lr=3.16e-06, throughput=5642 tok/s 2025-11-27 00:04:13,803 - INFO - Epoch 1 Step 1360 (Global: 9360): loss=1.5838, ppl=4.87, grad_norm=0.74, lr=3.10e-06, throughput=5641 tok/s 2025-11-27 00:05:38,926 - INFO - Epoch 1 Step 1370 (Global: 9370): loss=1.7932, ppl=6.01, grad_norm=0.78, lr=3.05e-06, throughput=5639 tok/s 2025-11-27 00:07:04,202 - INFO - Epoch 1 Step 1380 (Global: 9380): loss=1.6703, ppl=5.31, grad_norm=0.73, lr=2.99e-06, throughput=5629 tok/s 2025-11-27 00:08:29,058 - INFO - Epoch 1 Step 1390 (Global: 9390): loss=1.6097, ppl=5.00, grad_norm=0.75, lr=2.93e-06, throughput=5657 tok/s 2025-11-27 00:09:54,227 - INFO - Epoch 1 Step 1400 (Global: 9400): loss=1.7657, ppl=5.85, grad_norm=0.77, lr=2.88e-06, throughput=5636 tok/s 2025-11-27 00:11:19,189 - INFO - Epoch 1 Step 1410 (Global: 9410): loss=1.6739, ppl=5.33, grad_norm=0.74, lr=2.82e-06, throughput=5650 tok/s 2025-11-27 00:12:44,358 - INFO - Epoch 1 Step 1420 (Global: 9420): loss=1.4409, ppl=4.22, grad_norm=0.70, lr=2.76e-06, throughput=5636 tok/s 2025-11-27 00:14:09,756 - INFO - Epoch 1 Step 1430 (Global: 9430): loss=1.5605, ppl=4.76, grad_norm=0.71, lr=2.71e-06, throughput=5621 tok/s 2025-11-27 00:15:35,330 - INFO - Epoch 1 Step 1440 (Global: 9440): loss=1.6209, ppl=5.06, grad_norm=0.74, lr=2.66e-06, throughput=5609 tok/s 2025-11-27 00:17:00,665 - INFO - Epoch 1 Step 1450 (Global: 9450): loss=1.7848, ppl=5.96, grad_norm=0.80, lr=2.60e-06, throughput=5625 tok/s 2025-11-27 00:18:25,986 - INFO - Epoch 1 Step 1460 (Global: 9460): loss=1.6062, ppl=4.98, grad_norm=0.70, lr=2.55e-06, throughput=5626 tok/s 2025-11-27 00:19:51,690 - INFO - Epoch 1 Step 1470 (Global: 9470): loss=1.4450, ppl=4.24, grad_norm=0.71, lr=2.50e-06, throughput=5601 tok/s 2025-11-27 00:21:17,068 - INFO - Epoch 1 Step 1480 (Global: 9480): loss=1.7105, ppl=5.53, grad_norm=0.79, lr=2.44e-06, throughput=5622 tok/s 2025-11-27 00:22:42,768 - INFO - Epoch 1 Step 1490 (Global: 9490): loss=1.6780, ppl=5.35, grad_norm=0.74, lr=2.39e-06, throughput=5601 tok/s 2025-11-27 00:24:08,352 - INFO - Epoch 1 Step 1500 (Global: 9500): loss=1.7352, ppl=5.67, grad_norm=0.78, lr=2.34e-06, throughput=5609 tok/s 2025-11-27 00:25:33,529 - INFO - Epoch 1 Step 1510 (Global: 9510): loss=1.6719, ppl=5.32, grad_norm=0.73, lr=2.29e-06, throughput=5635 tok/s 2025-11-27 00:26:58,555 - INFO - Epoch 1 Step 1520 (Global: 9520): loss=1.6231, ppl=5.07, grad_norm=0.76, lr=2.24e-06, throughput=5645 tok/s 2025-11-27 00:28:23,936 - INFO - Epoch 1 Step 1530 (Global: 9530): loss=1.6721, ppl=5.32, grad_norm=0.74, lr=2.19e-06, throughput=5622 tok/s 2025-11-27 00:29:49,095 - INFO - Epoch 1 Step 1540 (Global: 9540): loss=1.7454, ppl=5.73, grad_norm=0.74, lr=2.14e-06, throughput=5637 tok/s 2025-11-27 00:31:14,307 - INFO - Epoch 1 Step 1550 (Global: 9550): loss=1.4295, ppl=4.18, grad_norm=0.73, lr=2.10e-06, throughput=5633 tok/s 2025-11-27 00:32:39,590 - INFO - Epoch 1 Step 1560 (Global: 9560): loss=1.5719, ppl=4.82, grad_norm=0.72, lr=2.05e-06, throughput=5628 tok/s 2025-11-27 00:34:04,841 - INFO - Epoch 1 Step 1570 (Global: 9570): loss=1.6448, ppl=5.18, grad_norm=0.75, lr=2.00e-06, throughput=5630 tok/s 2025-11-27 00:35:30,310 - INFO - Epoch 1 Step 1580 (Global: 9580): loss=1.8504, ppl=6.36, grad_norm=0.80, lr=1.95e-06, throughput=5616 tok/s 2025-11-27 00:36:55,414 - INFO - Epoch 1 Step 1590 (Global: 9590): loss=1.6490, ppl=5.20, grad_norm=0.73, lr=1.91e-06, throughput=5640 tok/s 2025-11-27 00:38:20,592 - INFO - Epoch 1 Step 1600 (Global: 9600): loss=1.6654, ppl=5.29, grad_norm=0.76, lr=1.86e-06, throughput=5635 tok/s 2025-11-27 00:39:45,520 - INFO - Epoch 1 Step 1610 (Global: 9610): loss=1.3434, ppl=3.83, grad_norm=0.67, lr=1.82e-06, throughput=5652 tok/s 2025-11-27 00:41:10,913 - INFO - Epoch 1 Step 1620 (Global: 9620): loss=1.8059, ppl=6.09, grad_norm=0.76, lr=1.77e-06, throughput=5621 tok/s 2025-11-27 00:42:36,205 - INFO - Epoch 1 Step 1630 (Global: 9630): loss=1.6062, ppl=4.98, grad_norm=0.73, lr=1.73e-06, throughput=5628 tok/s 2025-11-27 00:44:01,436 - INFO - Epoch 1 Step 1640 (Global: 9640): loss=1.6089, ppl=5.00, grad_norm=0.74, lr=1.68e-06, throughput=5632 tok/s 2025-11-27 00:45:26,584 - INFO - Epoch 1 Step 1650 (Global: 9650): loss=1.4486, ppl=4.26, grad_norm=0.73, lr=1.64e-06, throughput=5637 tok/s 2025-11-27 00:46:51,676 - INFO - Epoch 1 Step 1660 (Global: 9660): loss=1.5340, ppl=4.64, grad_norm=0.72, lr=1.60e-06, throughput=5641 tok/s 2025-11-27 00:48:16,836 - INFO - Epoch 1 Step 1670 (Global: 9670): loss=1.5839, ppl=4.87, grad_norm=0.73, lr=1.56e-06, throughput=5636 tok/s 2025-11-27 00:49:41,998 - INFO - Epoch 1 Step 1680 (Global: 9680): loss=1.5081, ppl=4.52, grad_norm=0.71, lr=1.52e-06, throughput=5636 tok/s 2025-11-27 00:51:07,307 - INFO - Epoch 1 Step 1690 (Global: 9690): loss=1.7420, ppl=5.71, grad_norm=0.76, lr=1.48e-06, throughput=5627 tok/s 2025-11-27 00:52:32,524 - INFO - Epoch 1 Step 1700 (Global: 9700): loss=1.6373, ppl=5.14, grad_norm=0.71, lr=1.44e-06, throughput=5633 tok/s 2025-11-27 00:53:57,955 - INFO - Epoch 1 Step 1710 (Global: 9710): loss=1.6144, ppl=5.02, grad_norm=0.74, lr=1.40e-06, throughput=5619 tok/s 2025-11-27 00:55:23,332 - INFO - Epoch 1 Step 1720 (Global: 9720): loss=1.5657, ppl=4.79, grad_norm=0.77, lr=1.36e-06, throughput=5622 tok/s 2025-11-27 00:56:48,690 - INFO - Epoch 1 Step 1730 (Global: 9730): loss=1.6926, ppl=5.43, grad_norm=0.77, lr=1.32e-06, throughput=5623 tok/s 2025-11-27 00:58:14,299 - INFO - Epoch 1 Step 1740 (Global: 9740): loss=1.6682, ppl=5.30, grad_norm=0.76, lr=1.28e-06, throughput=5607 tok/s 2025-11-27 00:59:39,398 - INFO - Epoch 1 Step 1750 (Global: 9750): loss=1.4327, ppl=4.19, grad_norm=0.71, lr=1.24e-06, throughput=5641 tok/s 2025-11-27 01:01:04,580 - INFO - Epoch 1 Step 1760 (Global: 9760): loss=1.5217, ppl=4.58, grad_norm=0.76, lr=1.21e-06, throughput=5635 tok/s 2025-11-27 01:02:29,526 - INFO - Epoch 1 Step 1770 (Global: 9770): loss=1.3844, ppl=3.99, grad_norm=0.75, lr=1.17e-06, throughput=5651 tok/s 2025-11-27 01:03:54,726 - INFO - Epoch 1 Step 1780 (Global: 9780): loss=1.5141, ppl=4.55, grad_norm=0.73, lr=1.13e-06, throughput=5634 tok/s 2025-11-27 01:05:19,883 - INFO - Epoch 1 Step 1790 (Global: 9790): loss=1.7337, ppl=5.66, grad_norm=0.75, lr=1.10e-06, throughput=5637 tok/s 2025-11-27 01:06:45,113 - INFO - Epoch 1 Step 1800 (Global: 9800): loss=1.4745, ppl=4.37, grad_norm=0.70, lr=1.06e-06, throughput=5632 tok/s 2025-11-27 01:08:10,505 - INFO - Epoch 1 Step 1810 (Global: 9810): loss=1.4786, ppl=4.39, grad_norm=0.73, lr=1.03e-06, throughput=5621 tok/s 2025-11-27 01:09:35,481 - INFO - Epoch 1 Step 1820 (Global: 9820): loss=1.4868, ppl=4.42, grad_norm=0.73, lr=9.97e-07, throughput=5649 tok/s 2025-11-27 01:11:00,571 - INFO - Epoch 1 Step 1830 (Global: 9830): loss=1.2109, ppl=3.36, grad_norm=0.67, lr=9.64e-07, throughput=5641 tok/s 2025-11-27 01:12:25,870 - INFO - Epoch 1 Step 1840 (Global: 9840): loss=1.5829, ppl=4.87, grad_norm=0.73, lr=9.32e-07, throughput=5627 tok/s 2025-11-27 01:13:51,183 - INFO - Epoch 1 Step 1850 (Global: 9850): loss=1.7179, ppl=5.57, grad_norm=0.73, lr=9.00e-07, throughput=5626 tok/s 2025-11-27 01:15:16,627 - INFO - Epoch 1 Step 1860 (Global: 9860): loss=1.5547, ppl=4.73, grad_norm=0.76, lr=8.68e-07, throughput=5618 tok/s 2025-11-27 01:16:42,228 - INFO - Epoch 1 Step 1870 (Global: 9870): loss=1.6220, ppl=5.06, grad_norm=0.75, lr=8.37e-07, throughput=5608 tok/s 2025-11-27 01:18:07,451 - INFO - Epoch 1 Step 1880 (Global: 9880): loss=1.6538, ppl=5.23, grad_norm=0.77, lr=8.07e-07, throughput=5632 tok/s 2025-11-27 01:19:32,707 - INFO - Epoch 1 Step 1890 (Global: 9890): loss=1.5520, ppl=4.72, grad_norm=0.71, lr=7.77e-07, throughput=5630 tok/s 2025-11-27 01:20:57,720 - INFO - Epoch 1 Step 1900 (Global: 9900): loss=1.5933, ppl=4.92, grad_norm=0.75, lr=7.48e-07, throughput=5646 tok/s 2025-11-27 01:22:23,135 - INFO - Epoch 1 Step 1910 (Global: 9910): loss=1.5251, ppl=4.60, grad_norm=0.73, lr=7.20e-07, throughput=5620 tok/s 2025-11-27 01:23:48,503 - INFO - Epoch 1 Step 1920 (Global: 9920): loss=1.7180, ppl=5.57, grad_norm=0.76, lr=6.92e-07, throughput=5623 tok/s 2025-11-27 01:25:13,817 - INFO - Epoch 1 Step 1930 (Global: 9930): loss=1.7286, ppl=5.63, grad_norm=0.75, lr=6.64e-07, throughput=5626 tok/s 2025-11-27 01:26:39,319 - INFO - Epoch 1 Step 1940 (Global: 9940): loss=1.6301, ppl=5.10, grad_norm=0.74, lr=6.37e-07, throughput=5614 tok/s 2025-11-27 01:28:04,434 - INFO - Epoch 1 Step 1950 (Global: 9950): loss=1.5596, ppl=4.76, grad_norm=0.69, lr=6.11e-07, throughput=5640 tok/s 2025-11-27 01:29:29,369 - INFO - Epoch 1 Step 1960 (Global: 9960): loss=1.7693, ppl=5.87, grad_norm=0.77, lr=5.85e-07, throughput=5651 tok/s 2025-11-27 01:30:54,572 - INFO - Epoch 1 Step 1970 (Global: 9970): loss=1.5269, ppl=4.60, grad_norm=0.73, lr=5.60e-07, throughput=5634 tok/s 2025-11-27 01:32:19,621 - INFO - Epoch 1 Step 1980 (Global: 9980): loss=1.6529, ppl=5.22, grad_norm=0.75, lr=5.35e-07, throughput=5644 tok/s 2025-11-27 01:33:44,808 - INFO - Epoch 1 Step 1990 (Global: 9990): loss=1.6452, ppl=5.18, grad_norm=0.75, lr=5.11e-07, throughput=5635 tok/s 2025-11-27 01:35:09,811 - INFO - Epoch 1 Step 2000 (Global: 10000): loss=1.7482, ppl=5.74, grad_norm=1.46, lr=4.87e-07, throughput=5647 tok/s 2025-11-27 01:35:09,812 - INFO - Running validation at step 10000... 2025-11-27 01:39:10,745 - INFO - Validation loss: 1.6137, perplexity: 5.02 2025-11-27 01:39:36,248 - INFO - Saved checkpoint to outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-27 01:39:36,255 - INFO - New best validation loss: 1.6137, perplexity: 5.02 2025-11-27 01:41:01,337 - INFO - Epoch 1 Step 2010 (Global: 10010): loss=1.5134, ppl=4.54, grad_norm=0.73, lr=4.64e-07, throughput=5642 tok/s 2025-11-27 01:42:26,435 - INFO - Epoch 1 Step 2020 (Global: 10020): loss=1.6800, ppl=5.37, grad_norm=0.75, lr=4.42e-07, throughput=5641 tok/s 2025-11-27 01:43:51,941 - INFO - Epoch 1 Step 2030 (Global: 10030): loss=1.6378, ppl=5.14, grad_norm=0.73, lr=4.20e-07, throughput=5614 tok/s 2025-11-27 01:45:16,842 - INFO - Epoch 1 Step 2040 (Global: 10040): loss=1.4687, ppl=4.34, grad_norm=0.77, lr=3.98e-07, throughput=5654 tok/s 2025-11-27 01:46:42,008 - INFO - Epoch 1 Step 2050 (Global: 10050): loss=1.5434, ppl=4.68, grad_norm=0.76, lr=3.78e-07, throughput=5636 tok/s 2025-11-27 01:48:07,344 - INFO - Epoch 1 Step 2060 (Global: 10060): loss=1.3892, ppl=4.01, grad_norm=0.73, lr=3.57e-07, throughput=5625 tok/s 2025-11-27 01:49:32,533 - INFO - Epoch 1 Step 2070 (Global: 10070): loss=1.7548, ppl=5.78, grad_norm=0.74, lr=3.38e-07, throughput=5635 tok/s 2025-11-27 01:50:57,793 - INFO - Epoch 1 Step 2080 (Global: 10080): loss=1.6015, ppl=4.96, grad_norm=0.77, lr=3.18e-07, throughput=5630 tok/s 2025-11-27 01:52:22,775 - INFO - Epoch 1 Step 2090 (Global: 10090): loss=1.5762, ppl=4.84, grad_norm=0.73, lr=3.00e-07, throughput=5648 tok/s 2025-11-27 01:53:47,816 - INFO - Epoch 1 Step 2100 (Global: 10100): loss=1.7489, ppl=5.75, grad_norm=0.74, lr=2.82e-07, throughput=5644 tok/s 2025-11-27 01:55:12,777 - INFO - Epoch 1 Step 2110 (Global: 10110): loss=1.6881, ppl=5.41, grad_norm=0.76, lr=2.64e-07, throughput=5650 tok/s 2025-11-27 01:56:37,740 - INFO - Epoch 1 Step 2120 (Global: 10120): loss=1.3981, ppl=4.05, grad_norm=0.73, lr=2.47e-07, throughput=5650 tok/s 2025-11-27 01:58:03,015 - INFO - Epoch 1 Step 2130 (Global: 10130): loss=1.4423, ppl=4.23, grad_norm=0.75, lr=2.31e-07, throughput=5629 tok/s 2025-11-27 01:59:28,052 - INFO - Epoch 1 Step 2140 (Global: 10140): loss=1.5886, ppl=4.90, grad_norm=0.74, lr=2.15e-07, throughput=5645 tok/s 2025-11-27 02:00:53,241 - INFO - Epoch 1 Step 2150 (Global: 10150): loss=1.4212, ppl=4.14, grad_norm=0.71, lr=2.00e-07, throughput=5635 tok/s 2025-11-27 02:02:18,225 - INFO - Epoch 1 Step 2160 (Global: 10160): loss=1.5972, ppl=4.94, grad_norm=0.76, lr=1.85e-07, throughput=5648 tok/s 2025-11-27 02:03:43,280 - INFO - Epoch 1 Step 2170 (Global: 10170): loss=1.3200, ppl=3.74, grad_norm=0.71, lr=1.71e-07, throughput=5643 tok/s 2025-11-27 02:05:08,480 - INFO - Epoch 1 Step 2180 (Global: 10180): loss=1.6009, ppl=4.96, grad_norm=0.79, lr=1.58e-07, throughput=5634 tok/s 2025-11-27 02:06:33,733 - INFO - Epoch 1 Step 2190 (Global: 10190): loss=1.7538, ppl=5.78, grad_norm=0.78, lr=1.45e-07, throughput=5630 tok/s 2025-11-27 02:07:58,621 - INFO - Epoch 1 Step 2200 (Global: 10200): loss=1.4281, ppl=4.17, grad_norm=0.70, lr=1.32e-07, throughput=5655 tok/s 2025-11-27 02:09:23,751 - INFO - Epoch 1 Step 2210 (Global: 10210): loss=1.5379, ppl=4.65, grad_norm=0.73, lr=1.20e-07, throughput=5639 tok/s 2025-11-27 02:10:49,685 - INFO - Epoch 1 Step 2220 (Global: 10220): loss=1.6097, ppl=5.00, grad_norm=0.73, lr=1.09e-07, throughput=5586 tok/s 2025-11-27 02:12:15,364 - INFO - Epoch 1 Step 2230 (Global: 10230): loss=1.5491, ppl=4.71, grad_norm=0.71, lr=9.81e-08, throughput=5602 tok/s 2025-11-27 02:13:40,955 - INFO - Epoch 1 Step 2240 (Global: 10240): loss=1.7321, ppl=5.65, grad_norm=0.77, lr=8.79e-08, throughput=5608 tok/s 2025-11-27 02:15:06,121 - INFO - Epoch 1 Step 2250 (Global: 10250): loss=1.5506, ppl=4.71, grad_norm=0.73, lr=7.83e-08, throughput=5636 tok/s 2025-11-27 02:16:31,487 - INFO - Epoch 1 Step 2260 (Global: 10260): loss=1.3623, ppl=3.91, grad_norm=0.72, lr=6.92e-08, throughput=5623 tok/s 2025-11-27 02:17:57,238 - INFO - Epoch 1 Step 2270 (Global: 10270): loss=1.3898, ppl=4.01, grad_norm=0.69, lr=6.06e-08, throughput=5598 tok/s 2025-11-27 02:19:22,622 - INFO - Epoch 1 Step 2280 (Global: 10280): loss=1.6836, ppl=5.39, grad_norm=0.82, lr=5.27e-08, throughput=5622 tok/s 2025-11-27 02:20:48,828 - INFO - Epoch 1 Step 2290 (Global: 10290): loss=1.5457, ppl=4.69, grad_norm=0.72, lr=4.53e-08, throughput=5568 tok/s 2025-11-27 02:22:14,184 - INFO - Epoch 1 Step 2300 (Global: 10300): loss=1.6513, ppl=5.21, grad_norm=0.72, lr=3.84e-08, throughput=5624 tok/s 2025-11-27 02:23:39,270 - INFO - Epoch 1 Step 2310 (Global: 10310): loss=1.8626, ppl=6.44, grad_norm=0.86, lr=3.21e-08, throughput=5641 tok/s 2025-11-27 02:25:04,896 - INFO - Epoch 1 Step 2320 (Global: 10320): loss=1.3393, ppl=3.82, grad_norm=0.71, lr=2.64e-08, throughput=5606 tok/s 2025-11-27 02:26:30,285 - INFO - Epoch 1 Step 2330 (Global: 10330): loss=1.7108, ppl=5.53, grad_norm=0.73, lr=2.12e-08, throughput=5621 tok/s 2025-11-27 02:27:55,496 - INFO - Epoch 1 Step 2340 (Global: 10340): loss=1.5594, ppl=4.76, grad_norm=0.75, lr=1.66e-08, throughput=5633 tok/s 2025-11-27 02:29:20,675 - INFO - Epoch 1 Step 2350 (Global: 10350): loss=1.6121, ppl=5.01, grad_norm=0.71, lr=1.26e-08, throughput=5635 tok/s 2025-11-27 02:30:46,044 - INFO - Epoch 1 Step 2360 (Global: 10360): loss=1.4885, ppl=4.43, grad_norm=0.73, lr=9.12e-09, throughput=5623 tok/s 2025-11-27 02:32:11,283 - INFO - Epoch 1 Step 2370 (Global: 10370): loss=1.6003, ppl=4.95, grad_norm=0.72, lr=6.20e-09, throughput=5631 tok/s 2025-11-27 02:33:36,338 - INFO - Epoch 1 Step 2380 (Global: 10380): loss=1.7616, ppl=5.82, grad_norm=0.77, lr=3.84e-09, throughput=5643 tok/s 2025-11-27 02:35:01,245 - INFO - Epoch 1 Step 2390 (Global: 10390): loss=1.6383, ppl=5.15, grad_norm=0.78, lr=2.05e-09, throughput=5653 tok/s 2025-11-27 02:36:25,968 - INFO - Epoch 1 Step 2400 (Global: 10400): loss=1.4296, ppl=4.18, grad_norm=0.70, lr=8.11e-10, throughput=5666 tok/s 2025-11-27 02:37:51,218 - INFO - Epoch 1 Step 2410 (Global: 10410): loss=1.7089, ppl=5.52, grad_norm=0.73, lr=1.38e-10, throughput=5631 tok/s 2025-11-27 02:38:47,988 - INFO - Flushing 3 remainder batches from gradient accumulation 2025-11-27 02:38:47,989 - INFO - Rescaling gradients by 1.33x (compensating for 3/4 batches) 2025-11-27 02:38:48,256 - INFO - Remainder batch: loss=1.4945, ppl=4.46, grad_norm=0.93 2025-11-27 02:38:48,263 - INFO - Epoch 1 training: loss=1.6015, ppl=4.96, grad_norm=0.75, throughput=5557 tok/s (20874.1s total) 2025-11-27 02:38:48,270 - INFO - Running final validation... 2025-11-27 02:42:48,303 - INFO - Validation loss: 1.6137, perplexity: 5.02 2025-11-27 02:43:13,671 - INFO - Saved checkpoint to outputs/production_text_ctx277_lm_20251125_003839/best_checkpoint.pt 2025-11-27 02:43:13,677 - INFO - New best validation loss: 1.6137, perplexity: 5.02 2025-11-27 02:43:13,677 - INFO - Training complete! 2025-11-27 02:43:13,678 - INFO - Final checkpoint is best, created symlink to save space (~2GB saved) 2025-11-27 02:43:13,678 - INFO - Best validation loss: 1.6137, perplexity: 5.02 2025-11-27 02:43:13,678 - INFO - Checkpoints saved to outputs/production_text_ctx277_lm_20251125_003839 2025-11-27 02:43:14,445 - INFO - W&B run finished