1: W1124 00:08:17.923000 737761 torch/distributed/run.py:792] 1: W1124 00:08:17.923000 737761 torch/distributed/run.py:792] ***************************************** 1: W1124 00:08:17.923000 737761 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 1: W1124 00:08:17.923000 737761 torch/distributed/run.py:792] ***************************************** 0: W1124 00:08:17.924000 3081902 torch/distributed/run.py:792] 0: W1124 00:08:17.924000 3081902 torch/distributed/run.py:792] ***************************************** 0: W1124 00:08:17.924000 3081902 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 0: W1124 00:08:17.924000 3081902 torch/distributed/run.py:792] ***************************************** 2: W1124 00:08:17.928000 1779991 torch/distributed/run.py:792] 2: W1124 00:08:17.928000 1779991 torch/distributed/run.py:792] ***************************************** 2: W1124 00:08:17.928000 1779991 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 2: W1124 00:08:17.928000 1779991 torch/distributed/run.py:792] ***************************************** 3: W1124 00:08:17.934000 3626745 torch/distributed/run.py:792] 3: W1124 00:08:17.934000 3626745 torch/distributed/run.py:792] ***************************************** 3: W1124 00:08:17.934000 3626745 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 3: W1124 00:08:17.934000 3626745 torch/distributed/run.py:792] ***************************************** 2: [2025-11-24 00:08:36,323] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:1780066] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing` 0: [2025-11-24 00:08:36,323] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:3081979] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing` 2: [2025-11-24 00:08:36,323] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:1780066] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing 0: [2025-11-24 00:08:36,323] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:3081979] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing 3: [2025-11-24 00:08:36,434] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:3626820] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing` 3: [2025-11-24 00:08:36,434] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:3626820] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing 1: [2025-11-24 00:08:36,535] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:119] [PID:737836] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing` 1: [2025-11-24 00:08:36,535] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:218] [PID:737836] [RANK:0] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing 0: [2025-11-24 00:08:40,005] [WARNING] [axolotl.utils.config.normalize_config:139] [PID:3081979] [RANK:0] Invalid value for save_steps (1.6666666666666667) from saves_per_epoch and/or num_epochs. Saving at training end only. 0: [2025-11-24 00:08:40,025] [INFO] [axolotl.cli.config.load_cfg:245] [PID:3081979] [RANK:0] config: 0: { 0: "activation_offloading": false, 0: "auto_resume_from_checkpoints": true, 0: "axolotl_config_path": "/lustre/fswork/projects/rech/dgo/udv55np/train/tmp/1763939290349239182.yaml", 0: "base_model": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-12b", 0: "base_model_config": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-12b", 0: "batch_size": 16, 0: "bf16": true, 0: "capabilities": { 0: "bf16": true, 0: "compute_capability": "sm_90", 0: "fp8": false, 0: "n_gpu": 16, 0: "n_node": 1 0: }, 0: "chat_template": "gemma3", 0: "context_parallel_size": 1, 0: "dataloader_num_workers": 16, 0: "dataloader_pin_memory": true, 0: "dataloader_prefetch_factor": 256, 0: "dataset_prepared_path": "/lustre/fswork/projects/rech/dgo/udv55np/dataset_gemma/Nemotron-Super-49B-v1_5/split_0", 0: "dataset_processes": 192, 0: "datasets": [ 0: { 0: "chat_template": "tokenizer_default", 0: "data_files": [ 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0007.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0009.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0005.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0006.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0014.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0010.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0012.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0008.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0001.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0002.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0013.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0015.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0004.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0011.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0000.jsonl", 0: "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking/0003.jsonl" 0: ], 0: "ds_type": "json", 0: "field_messages": "conversations", 0: "message_property_mappings": { 0: "content": "content", 0: "role": "role" 0: }, 0: "path": "/lustre/fswork/projects/rech/qwv/udv55np/dataset/ift/Nemotron-Super-49B-v1_5/no_thinking", 0: "trust_remote_code": false, 0: "type": "chat_template" 0: } 0: ], 0: "ddp": true, 0: "deepspeed": { 0: "bf16": { 0: "enabled": true 0: }, 0: "gradient_accumulation_steps": "auto", 0: "gradient_clipping": "auto", 0: "train_batch_size": "auto", 0: "train_micro_batch_size_per_gpu": "auto", 0: "wall_clock_breakdown": false, 0: "zero_optimization": { 0: "contiguous_gradients": true, 0: "overlap_comm": true, 0: "reduce_bucket_size": "auto", 0: "stage": 3, 0: "stage3_gather_16bit_weights_on_model_save": true, 0: "stage3_param_persistence_threshold": "auto", 0: "stage3_prefetch_bucket_size": "auto", 0: "sub_group_size": 0 0: } 0: }, 0: "device": "cuda:0", 0: "device_map": { 0: "": 0 0: }, 0: "dion_rank_fraction": 1.0, 0: "dion_rank_multiple_of": 1, 0: "env_capabilities": { 0: "torch_version": "2.6.0" 0: }, 0: "eot_tokens": [ 0: "" 0: ], 0: "eval_batch_size": 1, 0: "eval_causal_lm_metrics": [ 0: "sacrebleu", 0: "comet", 0: "ter", 0: "chrf" 0: ], 0: "eval_max_new_tokens": 128, 0: "eval_sample_packing": true, 0: "eval_table_size": 0, 0: "evals_per_epoch": 0, 0: "flash_attention": true, 0: "fp16": false, 0: "gradient_accumulation_steps": 1, 0: "gradient_checkpointing": true, 0: "gradient_checkpointing_kwargs": { 0: "use_reentrant": true 0: }, 0: "is_multimodal": true, 0: "learning_rate": 2e-06, 0: "lisa_layers_attribute": "model.layers", 0: "load_best_model_at_end": false, 0: "load_in_4bit": false, 0: "load_in_8bit": false, 0: "local_rank": 0, 0: "logging_steps": 10, 0: "lora_dropout": 0.0, 0: "loraplus_lr_embedding": 1e-06, 0: "lr_scheduler": "warmup_stable_decay", 0: "lr_scheduler_kwargs": { 0: "min_lr_ratio": 0.1, 0: "num_decay_steps": 200 0: }, 0: "max_prompt_len": 512, 0: "mean_resizing_embeddings": false, 0: "micro_batch_size": 1, 0: "model_config_type": "gemma3", 0: "num_epochs": 0.6, 0: "optimizer": "adamw_torch_fused", 0: "output_dir": "/lustre/fswork/projects/rech/dgo/udv55np/ift/Nemotron-Super-49B-v1_5/gemma-3-12b/0", 0: "pad_to_sequence_len": true, 0: "pretrain_multipack_attn": true, 0: "pretrain_multipack_buffer_size": 10000, 0: "processor_config": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-12b", 0: "profiler_steps_start": 0, 0: "qlora_sharded_model_loading": false, 0: "ray_num_workers": 1, 0: "resources_per_worker": { 0: "GPU": 1 0: }, 0: "sample_packing": true, 0: "sample_packing_bin_size": 200, 0: "sample_packing_group_size": 100000, 0: "save_only_model": true, 0: "save_safetensors": true, 0: "save_total_limit": 20, 0: "saves_per_epoch": 1, 0: "sequence_len": 16384, 0: "shuffle_before_merging_datasets": false, 0: "shuffle_merged_datasets": true, 0: "skip_prepare_dataset": false, 0: "strict": false, 0: "tensor_parallel_size": 1, 0: "tf32": false, 0: "tiled_mlp_use_original_mlp": true, 0: "tokenizer_config": "/lustre/fswork/projects/rech/qwv/udv55np/Gemma/base/gemma-3-27b", 0: "torch_dtype": "torch.bfloat16", 0: "train_on_inputs": false, 0: "trl": { 0: "log_completions": false, 0: "mask_truncated_completions": false, 0: "ref_model_mixup_alpha": 0.9, 0: "ref_model_sync_steps": 64, 0: "scale_rewards": true, 0: "sync_ref_model": false, 0: "use_vllm": false, 0: "vllm_server_host": "0.0.0.0", 0: "vllm_server_port": 8000 0: }, 0: "use_ray": false, 0: "use_tensorboard": true, 0: "val_set_size": 0.0, 0: "vllm": { 0: "device": "auto", 0: "dtype": "auto", 0: "gpu_memory_utilization": 0.9, 0: "host": "0.0.0.0", 0: "port": 8000 0: }, 0: "warmup_steps": 100, 0: "weight_decay": 0.0, 0: "world_size": 16 0: } 0: [2025-11-24 00:08:40,026] [INFO] [axolotl.cli.checks.check_user_token:35] [PID:3081979] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used. 0: [2025-11-24 00:08:41,217] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:472] [PID:3081979] [RANK:0] Loading prepared dataset from disk at /lustre/fswork/projects/rech/dgo/udv55np/dataset_gemma/Nemotron-Super-49B-v1_5/split_0/06698e902d3dba325ca34849b1dea5ea... 0: [2025-11-24 00:09:14,927] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:436] [PID:3081979] [RANK:0] gather_len_batches: [18976, 18976, 18976, 18975, 18977, 18976, 18975, 18976, 18976, 18975, 18976, 18976, 18976, 18976, 18976, 18976] 0: [2025-11-24 00:09:14,950] [INFO] [axolotl.utils.trainer.calc_sample_packing_eff_est:495] [PID:3081979] [RANK:0] sample_packing_eff_est across ranks: [0.9989354014396667, 0.9988301396369934, 0.9989880323410034, 0.9988827705383301, 0.9988827705383301, 0.9988827705383301, 0.9989354014396667, 0.9989354014396667, 0.9989354014396667, 0.9988827705383301, 0.9989354014396667, 0.9988827705383301, 0.9988827705383301, 0.9988827705383301, 0.9988827705383301, 0.9989354014396667] 0: [2025-11-24 00:09:14,959] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:127] [PID:3081979] [RANK:0] Maximum number of steps set at 711 3: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 1: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 3: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 1: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 2: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 2: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 2: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 2: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 1: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 1: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 0: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 3: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 3: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 0: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 0: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 0: Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. 0: [2025-11-24 00:09:22,718] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:110] [PID:3081979] [RANK:0] Patched Trainer.evaluation_loop with nanmean loss calculation 0: [2025-11-24 00:09:22,719] [INFO] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:164] [PID:3081979] [RANK:0] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation 3: Loading checkpoint shards: 0%| | 0/5 [00:00