INFO 10-21 08:52:02 [__init__.py:225] Automatically detected platform cuda. [2025-10-21 08:52:06] INFO __main__.py:429: Passed `--trust_remote_code`, setting environment variable `HF_DATASETS_TRUST_REMOTE_CODE=true` [2025-10-21 08:52:06] INFO __main__.py:446: Selected Tasks: ['gsm8k'] [2025-10-21 08:52:06] INFO evaluator.py:202: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 [2025-10-21 08:52:06] INFO evaluator.py:240: Initializing vllm model, with arguments: {'pretrained': '/mnt/nvme2/eldar/for_nvidia/calib1024_damp0.07_obsmse_symTrue', 'tensor_parallel_size': 1, 'trust_remote_code': True} INFO 10-21 08:52:06 [utils.py:243] non-default args: {'trust_remote_code': True, 'seed': 1234, 'disable_log_stats': True, 'model': '/mnt/nvme2/eldar/for_nvidia/calib1024_damp0.07_obsmse_symTrue'} INFO 10-21 08:52:06 [model.py:663] Resolved architecture: NemotronHForCausalLM INFO 10-21 08:52:06 [model.py:1751] Using max model len 131072 The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. INFO 10-21 08:52:07 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=16384. INFO 10-21 08:52:07 [config.py:324] Disabling cascade attention since it is not supported for hybrid models. INFO 10-21 08:52:07 [config.py:440] Setting attention block size to 672 tokens to ensure that attention page size is >= mamba page size. INFO 10-21 08:52:07 [config.py:464] Padding mamba page size by 2.13% to ensure that mamba page size and attention page size are exactly equal. (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:08 [core.py:730] Waiting for init message from front-end. (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:08 [core.py:97] Initializing a V1 LLM engine (v0.11.1rc2.dev191+g80e945298) with config: model='/mnt/nvme2/eldar/for_nvidia/calib1024_damp0.07_obsmse_symTrue', speculative_config=None, tokenizer='/mnt/nvme2/eldar/for_nvidia/calib1024_damp0.07_obsmse_symTrue', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=/mnt/nvme2/eldar/for_nvidia/calib1024_damp0.07_obsmse_symTrue, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': , 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 512, 'local_cache_dir': None} [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:10 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:12 [gpu_model_runner.py:2860] Starting to load model /mnt/nvme2/eldar/for_nvidia/calib1024_damp0.07_obsmse_symTrue... (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:12 [compressed_tensors_wNa16.py:108] Using MacheteLinearKernel for CompressedTensorsWNA16 (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:12 [compressed_tensors_wNa16.py:108] Using MarlinLinearKernel for CompressedTensorsWNA16 (EngineCore_DP0 pid=1751394) INFO 10-21 08:52:12 [cuda.py:403] Using Flash Attention backend on V1 engine. (EngineCore_DP0 pid=1751394) Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0} [2025-10-21 08:52:45] WARNING evaluator.py:324: Overwriting default num_fewshot of gsm8k from 5 to 5 [2025-10-21 08:52:45] INFO task.py:434: Building contexts for gsm8k on rank 0... 0%| | 0/1319 [00:00