I got ValueError

by spow12 - opened about 1 month ago

about 1 month ago

•

HI, thank you for great work.
I'm usually use qwen-series for my local development... but, i got value error when i using this model.

Here is my setting:

python: 3.12
env: pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly (Installed at 2025/09/23 8:47AM GMT+9)
GPU: 2*A100

Run vllm using:

export HF_HOME=/data/huggingface_cache
export OMP_NUM_THREADS=8

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --api-key token-abc123 \
  --tensor-parallel-size 2 \
  --served-model-name chat_model \
  --port 5580 \
  --enable-auto-tool-choice  --tool-call-parser hermes

And, i got

(Worker_TP0 pid=234067) ERROR 09-23 09:51:10 [multiproc_executor.py:585] ValueError: Detected some but not all shards of model.layers.0.linear_attn.in_proj are quantized. All shards of fused layers to have the same precision.

traphix

about 1 month ago

A100 does not support fp8

spow12

about 1 month ago

Hi,

Thank you for the response.

My understanding is that partial support is available through software implementations like the FP8 Marlin kernel. As mentioned in GitHub issue.

And also, i am successfully serving other FP8 models (such as Qwen3-30B-3A) on this exact A100 setup without encountering this error.

The ValueError: Detected some but not all shards... are quantized seems to be specific to the Qwen3-Next-80B-A3B-Instruct-FP8 model.

I would appreciate it if you could point out anything I'm misunderstanding about this.

Thanks.

vanchris

about 1 month ago

same error in H100 ...

jklj077

Qwen org about 1 month ago

For ValueError: Detected some but not all shards of model.layers.0.linear_attn.in_proj are quantized. All shards of fused layers to have the same precision., it is probably caused by versions that were too low. You can fisrt try running pip list and see which vllm version is installed. If there is no things like dev or +g in the version string, try something like

pip install "vllm!=0.10.2" --pre --extra-index-url https://wheels.vllm.ai/nightly

jklj077

Qwen org about 1 month ago

For Ampere GPUs, this comment https://github.com/vllm-project/vllm/pull/25079#issuecomment-3305384401 may also help.

vanchris

about 1 month ago

i used vllm 0.10.2 . I think 0.10.2 is recent than 0.10.2 rc3 right?

vanchris

about 1 month ago

pip install "vllm!=0.10.2" --pre --extra-index-url https://wheels.vllm.ai/nightly

I installed vllm 0.10.2 rc3 and maybe resolve in_proj error but

AttributeError: '_0PNamespace' '_moe_c' object has no attribute 'topk_softmax'

new error occur!!

spow12

30 days ago

I update vllm follwoing version:

vllm 0.11.0rc2.dev38+g5e25b1223.cu129

After this, qwen3-80B work fine without mtp.

But, i got error with mtp option like:

export HF_HOME=/data/yw_nam/huggingface_cache
export OMP_NUM_THREADS=8

# even I reduce the max-model-len to 32768, still got same error.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --api-key token-abc123 \
  --tensor-parallel-size 2 \
  --served-model-name chat_model \
  --port 5580 \
  --enable-auto-tool-choice  --tool-call-parser hermes \
  --max-model-len 262144 \ 
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Here is the full log:

INFO 09-24 09:47:06 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=4121880) INFO 09-24 09:47:14 [api_server.py:1822] vLLM API server version 0.11.0rc2.dev38+g5e25b1223
(APIServer pid=4121880) INFO 09-24 09:47:14 [utils.py:328] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'port': 5580, 'api_key': ['token-abc123'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'max_model_len': 262144, 'served_model_name': ['chat_model'], 'tensor_parallel_size': 2, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=4121880) INFO 09-24 09:47:16 [model.py:550] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=4121880) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=4121880) INFO 09-24 09:47:16 [model.py:1577] Using max model len 262144
(APIServer pid=4121880) INFO 09-24 09:47:19 [model.py:550] Resolved architecture: Qwen3NextMTP
(APIServer pid=4121880) INFO 09-24 09:47:19 [model.py:1577] Using max model len 262144
(APIServer pid=4121880) WARNING 09-24 09:47:19 [speculative.py:332] All Qwen3Next MTP models only have one layer. Might need some code changes to support multiple layers.
(APIServer pid=4121880) INFO 09-24 09:47:19 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4121880) INFO 09-24 09:47:19 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=4121880) INFO 09-24 09:47:19 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=4121880) INFO 09-24 09:47:20 [config.py:390] Setting attention block size to 560 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=4121880) INFO 09-24 09:47:20 [config.py:411] Padding mamba page size by 1.45% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-24 09:47:28 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=4122318) INFO 09-24 09:47:31 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=4122318) INFO 09-24 09:47:31 [core.py:77] Initializing a V1 LLM engine (v0.11.0rc2.dev38+g5e25b1223) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=SpeculativeConfig(method='qwen3_next_mtp', model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', num_spec_tokens=2), tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=chat_model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["+quant_fp8"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=4122318) INFO 09-24 09:47:31 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_0a9171a2'), local_subscribe_addr='ipc:///tmp/9bfdd3f2-c620-4525-9218-14e8140eaac3', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-24 09:47:38 [__init__.py:216] Automatically detected platform cuda.
INFO 09-24 09:47:38 [__init__.py:216] Automatically detected platform cuda.
INFO 09-24 09:47:43 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1d7c81ad'), local_subscribe_addr='ipc:///tmp/d645b72b-7845-4d0a-bfbf-db664633caa9', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-24 09:47:43 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_fb0f5a8c'), local_subscribe_addr='ipc:///tmp/48d6c452-2e65-474b-837e-27ab9d663f10', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W924 09:47:44.145627302 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W924 09:47:44.146252962 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 09-24 09:47:45 [__init__.py:1382] Found nccl from library libnccl.so.2
INFO 09-24 09:47:45 [__init__.py:1382] Found nccl from library libnccl.so.2
INFO 09-24 09:47:45 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 09-24 09:47:45 [pynccl.py:103] vLLM is using nccl==2.27.3
WARNING 09-24 09:47:45 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.
WARNING 09-24 09:47:45 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.
INFO 09-24 09:47:45 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-24 09:47:45 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-24 09:47:45 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_f535de3a'), local_subscribe_addr='ipc:///tmp/eedb85be-b1f0-420f-8739-3e6ebb16e975', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 09-24 09:47:45 [parallel_state.py:1201] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 09-24 09:47:45 [parallel_state.py:1201] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 09-24 09:47:45 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 09-24 09:47:46 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP1 pid=4122452) INFO 09-24 09:47:46 [gpu_model_runner.py:2539] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(Worker_TP0 pid=4122451) INFO 09-24 09:47:46 [gpu_model_runner.py:2539] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(Worker_TP1 pid=4122452) INFO 09-24 09:47:46 [gpu_model_runner.py:2571] Loading model from scratch...
(Worker_TP1 pid=4122452) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker_TP1 pid=4122452) WARNING 09-24 09:47:46 [fp8.py:465] Failed to import DeepGemm kernels.
(Worker_TP1 pid=4122452) WARNING 09-24 09:47:46 [fp8.py:488] CutlassBlockScaledGroupedGemm not supported on the current platform.
(Worker_TP0 pid=4122451) INFO 09-24 09:47:46 [gpu_model_runner.py:2571] Loading model from scratch...
(Worker_TP0 pid=4122451) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker_TP0 pid=4122451) WARNING 09-24 09:47:46 [fp8.py:465] Failed to import DeepGemm kernels.
(Worker_TP0 pid=4122451) WARNING 09-24 09:47:46 [fp8.py:488] CutlassBlockScaledGroupedGemm not supported on the current platform.
(Worker_TP0 pid=4122451) INFO 09-24 09:47:46 [cuda.py:347] Using Flash Attention backend on V1 engine.
(Worker_TP1 pid=4122452) INFO 09-24 09:47:46 [cuda.py:347] Using Flash Attention backend on V1 engine.
(Worker_TP1 pid=4122452) INFO 09-24 09:47:47 [weight_utils.py:392] Using model weights format ['*.safetensors']
(Worker_TP0 pid=4122451) INFO 09-24 09:47:47 [weight_utils.py:392] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:05<00:39,  5.71s/it]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:12<00:37,  6.25s/it]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:16<00:25,  5.09s/it]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:21<00:21,  5.28s/it]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:28<00:17,  5.68s/it]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:34<00:11,  5.97s/it]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:40<00:06,  6.09s/it]
(Worker_TP1 pid=4122452) INFO 09-24 09:48:36 [default_loader.py:267] Loading weights took 47.44 seconds
(Worker_TP1 pid=4122452) WARNING 09-24 09:48:36 [marlin_utils_fp8.py:80] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:47<00:00,  6.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:47<00:00,  5.90s/it]
(Worker_TP0 pid=4122451) 
(Worker_TP0 pid=4122451) INFO 09-24 09:48:37 [default_loader.py:267] Loading weights took 48.01 seconds
(Worker_TP0 pid=4122451) WARNING 09-24 09:48:37 [marlin_utils_fp8.py:80] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker_TP1 pid=4122452) INFO 09-24 09:48:42 [gpu_model_runner.py:2578] Loading drafter model...
(Worker_TP1 pid=4122452) INFO 09-24 09:48:42 [weight_utils.py:392] Using model weights format ['*.safetensors']
(Worker_TP0 pid=4122451) INFO 09-24 09:48:44 [gpu_model_runner.py:2578] Loading drafter model...
(Worker_TP0 pid=4122451) INFO 09-24 09:48:44 [weight_utils.py:392] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:00<00:04,  1.74it/s]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:01<00:03,  1.64it/s]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:02<00:04,  1.14it/s]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:03<00:03,  1.21it/s]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:03<00:02,  1.38it/s]
(Worker_TP1 pid=4122452) INFO 09-24 09:48:49 [default_loader.py:267] Loading weights took 5.42 seconds
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:04<00:01,  1.47it/s]
(Worker_TP1 pid=4122452) INFO 09-24 09:48:49 [eagle.py:840] Assuming the EAGLE head shares the same vocab embedding with the target model.
(Worker_TP1 pid=4122452) INFO 09-24 09:48:49 [eagle.py:859] Loading EAGLE LM head weights from the target model.
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:04<00:00,  1.53it/s]
(Worker_TP1 pid=4122452) INFO 09-24 09:48:50 [gpu_model_runner.py:2590] Model loading took 38.8782 GiB and 62.950298 seconds
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:05<00:00,  1.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:05<00:00,  1.46it/s]
(Worker_TP0 pid=4122451) 
(Worker_TP0 pid=4122451) INFO 09-24 09:48:50 [default_loader.py:267] Loading weights took 5.62 seconds
(Worker_TP0 pid=4122451) INFO 09-24 09:48:50 [eagle.py:840] Assuming the EAGLE head shares the same vocab embedding with the target model.
(Worker_TP0 pid=4122451) INFO 09-24 09:48:50 [eagle.py:859] Loading EAGLE LM head weights from the target model.
(Worker_TP0 pid=4122451) INFO 09-24 09:48:51 [gpu_model_runner.py:2590] Model loading took 38.8782 GiB and 64.106744 seconds
(Worker_TP0 pid=4122451) INFO 09-24 09:49:04 [backends.py:548] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/08694ae76f/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP1 pid=4122452) INFO 09-24 09:49:04 [backends.py:548] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/08694ae76f/rank_1_0/backbone for vLLM's torch.compile
(Worker_TP1 pid=4122452) INFO 09-24 09:49:04 [backends.py:559] Dynamo bytecode transform time: 12.40 s
(Worker_TP0 pid=4122451) INFO 09-24 09:49:04 [backends.py:559] Dynamo bytecode transform time: 12.42 s
(Worker_TP0 pid=4122451) INFO 09-24 09:49:07 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.820 s
(Worker_TP1 pid=4122452) INFO 09-24 09:49:07 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.878 s
(Worker_TP0 pid=4122451) INFO 09-24 09:49:09 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
(Worker_TP1 pid=4122452) INFO 09-24 09:49:09 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
(Worker_TP0 pid=4122451) INFO 09-24 09:49:09 [monitor.py:34] torch.compile takes 12.42 s in total
(Worker_TP1 pid=4122452) INFO 09-24 09:49:09 [monitor.py:34] torch.compile takes 12.40 s in total
(Worker_TP0 pid=4122451) INFO 09-24 09:49:10 [backends.py:548] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/08694ae76f/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=4122451) INFO 09-24 09:49:10 [backends.py:559] Dynamo bytecode transform time: 0.80 s
(Worker_TP1 pid=4122452) INFO 09-24 09:49:10 [backends.py:548] Using cache directory: /home/ubuntu/.cache/vllm/torch_compile_cache/08694ae76f/rank_1_0/eagle_head for vLLM's torch.compile
(Worker_TP1 pid=4122452) INFO 09-24 09:49:10 [backends.py:559] Dynamo bytecode transform time: 0.80 s
(Worker_TP1 pid=4122452) INFO 09-24 09:49:11 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 0.105 s
(Worker_TP0 pid=4122451) INFO 09-24 09:49:11 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 0.105 s
(Worker_TP1 pid=4122452) INFO 09-24 09:49:11 [monitor.py:34] torch.compile takes 13.21 s in total
(Worker_TP0 pid=4122451) INFO 09-24 09:49:11 [monitor.py:34] torch.compile takes 13.21 s in total
(Worker_TP1 pid=4122452) INFO 09-24 09:49:13 [gpu_worker.py:306] Available KV cache memory: 30.54 GiB
(Worker_TP0 pid=4122451) INFO 09-24 09:49:13 [gpu_worker.py:306] Available KV cache memory: 30.54 GiB
(EngineCore_DP0 pid=4122318) WARNING 09-24 09:49:13 [kv_cache_utils.py:982] Add 3 padding layers, may waste at most 8.33% KV cache memory
(EngineCore_DP0 pid=4122318) INFO 09-24 09:49:13 [kv_cache_utils.py:1087] GPU KV cache size: 615,440 tokens
(EngineCore_DP0 pid=4122318) INFO 09-24 09:49:13 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 9.32x
(EngineCore_DP0 pid=4122318) INFO 09-24 09:49:13 [kv_cache_utils.py:1087] GPU KV cache size: 615,440 tokens
(EngineCore_DP0 pid=4122318) INFO 09-24 09:49:13 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 9.32x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:08<00:00,  8.13it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                 | 0/65 [00:00<?, ?it/s]
(Worker_TP1 pid=4122452) INFO 09-24 09:49:22 [custom_all_reduce.py:203] Registering 4087 cuda graph addresses
(Worker_TP0 pid=4122451) INFO 09-24 09:49:22 [custom_all_reduce.py:203] Registering 4087 cuda graph addresses
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     output = func(*args, **kwargs)
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 352, in compile_or_warm_up_model
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3363, in capture_model
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._capture_cudagraphs(
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3431, in _capture_cudagraphs
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._dummy_run(num_tokens,
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     return func(*args, **kwargs)
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3006, in _dummy_run
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     .build_for_cudagraph_capture(common_attn_metadata)
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 317, in build_for_cudagraph_capture
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     assert (m.num_reqs * (self.num_spec + 1) <= m.num_actual_tokens
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671] AssertionError: GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     output = func(*args, **kwargs)
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 352, in compile_or_warm_up_model
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3363, in capture_model
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._capture_cudagraphs(
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3431, in _capture_cudagraphs
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._dummy_run(num_tokens,
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     return func(*args, **kwargs)
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3006, in _dummy_run
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     .build_for_cudagraph_capture(common_attn_metadata)
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 317, in build_for_cudagraph_capture
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     assert (m.num_reqs * (self.num_spec + 1) <= m.num_actual_tokens
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671] AssertionError: GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.
(Worker_TP1 pid=4122452) ERROR 09-24 09:49:23 [multiproc_executor.py:671] 
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     output = func(*args, **kwargs)
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 352, in compile_or_warm_up_model
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3363, in capture_model
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._capture_cudagraphs(
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3431, in _capture_cudagraphs
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._dummy_run(num_tokens,
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     return func(*args, **kwargs)
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3006, in _dummy_run
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     .build_for_cudagraph_capture(common_attn_metadata)
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 317, in build_for_cudagraph_capture
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     assert (m.num_reqs * (self.num_spec + 1) <= m.num_actual_tokens
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671] AssertionError: GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     output = func(*args, **kwargs)
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 352, in compile_or_warm_up_model
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3363, in capture_model
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._capture_cudagraphs(
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3431, in _capture_cudagraphs
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     self._dummy_run(num_tokens,
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     return func(*args, **kwargs)
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3006, in _dummy_run
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     .build_for_cudagraph_capture(common_attn_metadata)
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 317, in build_for_cudagraph_capture
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]     assert (m.num_reqs * (self.num_spec + 1) <= m.num_actual_tokens
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671] AssertionError: GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.
(Worker_TP0 pid=4122451) ERROR 09-24 09:49:23 [multiproc_executor.py:671] 
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in _initialize_kv_caches
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 75, in initialize_from_config
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708]     raise RuntimeError(
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:23 [core.py:708] RuntimeError: Worker failed with error 'GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.', please check the stack trace above for the root cause
(EngineCore_DP0 pid=4122318) ERROR 09-24 09:49:26 [multiproc_executor.py:154] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=4122318) Process EngineCore_DP0:
(EngineCore_DP0 pid=4122318) Traceback (most recent call last):
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=4122318)     self.run()
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=4122318)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=4122318)     raise e
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=4122318)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=4122318)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=4122318)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=4122318)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in _initialize_kv_caches
(EngineCore_DP0 pid=4122318)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 75, in initialize_from_config
(EngineCore_DP0 pid=4122318)     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(EngineCore_DP0 pid=4122318)     result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=4122318)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4122318)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(EngineCore_DP0 pid=4122318)     raise RuntimeError(
(EngineCore_DP0 pid=4122318) RuntimeError: Worker failed with error 'GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.', please check the stack trace above for the root cause
(APIServer pid=4121880) Traceback (most recent call last):
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/bin/vllm", line 7, in <module>
(APIServer pid=4121880)     sys.exit(main())
(APIServer pid=4121880)              ^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=4121880)     args.dispatch_function(args)
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=4121880)     uvloop.run(run_server(args))
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=4121880)     return __asyncio.run(
(APIServer pid=4121880)            ^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=4121880)     return runner.run(main)
(APIServer pid=4121880)            ^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=4121880)     return self._loop.run_until_complete(task)
(APIServer pid=4121880)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=4121880)     return await main
(APIServer pid=4121880)            ^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1867, in run_server
(APIServer pid=4121880)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1885, in run_server_worker
(APIServer pid=4121880)     async with build_async_engine_client(
(APIServer pid=4121880)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=4121880)     return await anext(self.gen)
(APIServer pid=4121880)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client
(APIServer pid=4121880)     async with build_async_engine_client_from_engine_args(
(APIServer pid=4121880)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=4121880)     return await anext(self.gen)
(APIServer pid=4121880)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 223, in build_async_engine_client_from_engine_args
(APIServer pid=4121880)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=4121880)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1570, in inner
(APIServer pid=4121880)     return fn(*args, **kwargs)
(APIServer pid=4121880)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=4121880)     return cls(
(APIServer pid=4121880)            ^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=4121880)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=4121880)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=4121880)     return AsyncMPClient(*client_args)
(APIServer pid=4121880)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=4121880)     super().__init__(
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=4121880)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=4121880)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=4121880)     next(self.gen)
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=4121880)     wait_for_engine_startup(
(APIServer pid=4121880)   File "/home/ubuntu/miniforge3/envs/yw_vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=4121880)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=4121880) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

jklj077

Qwen org 29 days ago

i used vllm 0.10.2 . I think 0.10.2 is recent than 0.10.2 rc3 right?

There was an issue in vLLM determining the versions of the nightly wheels, which is explained at https://github.com/vllm-project/vllm/issues/25476. It is now fixed.

AttributeError: '_0PNamespace' '_moe_c' object has no attribute 'topk_softmax'

It is commonly caused by mismatched CUDA versions among PyTorch and vLLM and other things. You can first try installing in a fresh environment, and if the errors still occur, you can also check these two issues:

AssertionError: GDN only supports decode-only full CUDAGraph capture. Make sure all cudagraph capture sizes <= max_num_seq.

You can try disabling full CUDAGraph capture manually by adding -O.cudagraph_mode PIECEWISE to the start command. We're not sure why it changes to use FULL by default for Qwen3-Next.

spow12

29 days ago

•

edited 29 days ago

You can try disabling full CUDAGraph capture manually by adding -O.cudagraph_mode PIECEWISE to the start command. We're not sure why it changes to use FULL by default for Qwen3-Next.

That worked. I've added the --enforce-eager option and can confirm that I'm successfully receiving responses from the Qwen3-80B model with MTP settings. Thank you for your help!

spow12 changed discussion status to closed 29 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment