Error when running in VLLM

#1
by d8rt8v - opened

I get KeyError: 'layers.31.mlp.shared_expert.down_proj.weight' when i run this quant on latest vllm (v0.10.2rc3.dev13+gfdb09c77d) with H100 GPU via

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Run command

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit  --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 8080 --max-model-len 60000

I have the same error on both this and the Thinking version.

I get KeyError: 'layers.31.mlp.shared_expert.down_proj.weight' when i run this quant on latest vllm (v0.10.2rc3.dev13+gfdb09c77d) with H100 GPU via

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Run command

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit  --gpu-memory-utilization 0.9 --host 0.0.0.0 --port 8080 --max-model-len 60000

The weight is in model-00007-of-00010.safetensors. Could you check the SHA256?

lrwxrwxrwx 1 owner owner 76 Sep 12 10:05 model-00007-of-00010.safetensors -> ../../blobs/f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37

sha256sum f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37
f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37 f8bc272ecbbf035e204b83cee5d610409a3ef33811838a5b5a10fcb10f452f37

Same issue here.

Get a similar error:

KeyError: 'layers.20.mlp.shared_expert.down_proj.weight'

...using the main branch of vLLM as recommended (0.10.2rc3.dev23+gb0d1213ac).

Owner

Hi everyone, I am really sorry for this.

In addition to the loading error, some important components are over-quantized and thus it outputs gibberish. The model is now re-quantized, and it should complete in the next 16-18 hours.

However, the model can be loaded by changing /vllm/vllm/model_executor/models/qwen3_next.py to qwen3_next.py

Hi everyone, I am really sorry for this.

In addition to the loading error, some important components are over-quantized and thus it outputs gibberish. The model is now re-quantized, and it should complete in the next 16-18 hours.

However, the model can be loaded by changing /vllm/vllm/model_executor/models/qwen3_next.py to qwen3_next.py

Thanks for the update. I will check back tomorrow!

Owner

Hey, I have reuploaded the weights and it works!

It turns out that ignoring shared_expert during the quantization process does not allow the model to load well with vllm afterwards.

For some reasons, NCCL_SHM_DISABLE=1 is required to not get nccl errors in my local environment, and I don't know about others. Please consider setting NCCL_SHM_DISABLE=1 if any nccl problem occurs.

Please redownload the weights, and let me know what you think!

I'm able to get a response from the model via API endpoints on my build. VLLM isn't optimal for my setup because I have one 4090 and 3 3090s, however I'm able to get responses! the speed isn't the best right now, but this model seems to be working in its current form.

I have no clue why, but sometimes upon startup, or after startup and with first prompt, or in loading large contexts, my CUDA Device: 2 keeps crashing hard, like the computer stops detecting it level bad. It's always this device, number 2. The model is spread evenly across all 4 cards, so I'm chalking it up to an issue with uneven architecture distribution. Just thought it should be noted, if this happens to other perhaps something else is up.

Startup Command:

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit   --tensor-parallel-size 4   --max-model-len 8192   --dtype float16 --enforce-eager

Speeds:

INFO 09-13 10:56:46 [loggers.py:123] Engine 000: Avg prompt throughput: 80.5 tokens/s, Avg generation throughput: 8.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

I'll admit, I'm an ik_llama.cpp type of guy, but I've been dying to test this model.
Great work @cpatonn ! I look forward to future optimizations

Great!
I ran the following command, and it works perfectly:

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
--tensor-parallel-size 2
--max-model-len 95000
--gpu-memory-utilization 0.88
--host 0.0.0.0
--port 11435
--dtype float16

Works for me now. Although it takes quite some time to see the first output. Feels like it does reasoning. Isn't this variant a "non-thinker"? ;-) Will have to check the vLLM issues that might pop up.

(EngineCore_DP0 pid=414811) /opt/pluski/svc/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (31) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=414811)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=414811) /opt/pluski/svc/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (31) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=414811)   return fn(*contiguous_args, **contiguous_kwargs)
(APIServer pid=414728) INFO 09-13 15:48:53 [loggers.py:123] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 106.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

I do not get any nccl errors. Only this warnings.

Running v0.10.2rc3.dev50+g15b8fef45 on a NVIDIA H100 NVL.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --dtype float16

Thanks for all your efforts!

So It is working now, although on the first request after startup it takes about 30 seconds to start outputting tokens.
If I start with:
NCCL_SHM_DISABLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,3,1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --port 8000 --tensor-parallel-size 2 --pipeline-parallel-size 2 --max-model-len 235000
I see this in the logs:

(Worker_PP0_TP0 pid=84334) /home/owner/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP0_TP0 pid=84334)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

And if I start with:
NCCL_SHM_DISABLE=1 VLLM_PP_LAYER_PARTITION="15,9,9,15" CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,3,1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --port 8000 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 235000
I see this in the logs:

(Worker_PP0 pid=81631) /home/owner/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (21) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_PP0 pid=81631)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_PP0 pid=81631) /home/owner/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP0 pid=81631)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

I am using this inside wsl2, but I almost exclusively use AWQ models from cpatonn and never saw any issues like this before.

Can this run on 2 * 24Gb cards? Mine are A5000 and i can't figure out parameters to run. Seems like --cpu-offload-gb is non functional atm, when enabling offload i get AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled. See https://github.com/vllm-project/vllm/pull/18298 for more details. and if i not enable offloading it's always OOM'd at some point, despite setting --max-model-len 1024 --gpu-memory-utilization 0.98 -tp 2 --max-num-seqs 4 (tried many other combination as well)

About the 30s to get the first answer, its happening to me too with the bf16 weights, I'm using pipeline parallelism though.

I'm unable to run this on vLLM 0.10.2 (on an RTX Pro 6000 96GB):

I see KeyError: 'layers.24.mlp.shared_expert.down_proj.weight'

Does it only run on the 0.10.2rc3 release candidate?

Owner

My apologies, the fix is merged 3 days ago and not in nightly yet. Please build vllm from source to use the latest model update.

Got it, thanks I can confirm that it works.

I seem to be getting this error on the latest vllm version Version: 0.11.1rc4.dev71+g94666612a.d20251028 do i still need to be on nightly branch for this fix?

still have problems on vllm v0.11.1-2
1xh800, cuda 13
docker run -d --restart=always --gpus '"device=0"' -p 58000:8000 -e HF_HOME=/root/.cache/huggingface --ipc=host --shm-size=80gb -v /data/rnd/qwen25:/root/.cache/huggingface -v /data/rnd/qwen25/start_vllm.sh:/tmp/start_vllm.sh --entrypoint /tmp/start_vllm.sh vllm/vllm-openai:v0.11.1 --model cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --gpu-memory-utilization 0.9 --port 8000 --tensor-parallel-size 1 --seed 1337 --max-model-len 262144 --max-num-seqs 32 --trust-remote-code --enable-auto-tool-choice --tool-call-parser hermes --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

(APIServer pid=11) INFO 11-20 04:13:57 [chat_utils.py:557] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=11) INFO: 10.35.56.2:57654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:15:03 [loggers.py:236] Engine 000: Avg prompt throughput: 206.5 tokens/s, Avg generation throughput: 55.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:15:03 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.95, Accepted throughput: 0.65 tokens/s, Drafted throughput: 0.67 tokens/s, Accepted: 367 tokens, Drafted: 376 tokens, Per-position acceptance rate: 0.989, 0.963, Avg Draft acceptance rate: 97.6%
(APIServer pid=11) INFO 11-20 04:15:13 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:17:23 [loggers.py:236] Engine 000: Avg prompt throughput: 206.5 tokens/s, Avg generation throughput: 164.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:17:23 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.97, Accepted throughput: 7.79 tokens/s, Drafted throughput: 7.91 tokens/s, Accepted: 1091 tokens, Drafted: 1108 tokens, Per-position acceptance rate: 0.986, 0.984, Avg Draft acceptance rate: 98.5%
(APIServer pid=11) INFO 11-20 04:17:33 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 231.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:17:33 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 154.58 tokens/s, Drafted throughput: 154.58 tokens/s, Accepted: 1546 tokens, Drafted: 1546 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=11) INFO: 10.35.56.2:53844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:53844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:17:43 [loggers.py:236] Engine 000: Avg prompt throughput: 206.5 tokens/s, Avg generation throughput: 61.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:17:43 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.91, Accepted throughput: 40.40 tokens/s, Drafted throughput: 42.30 tokens/s, Accepted: 404 tokens, Drafted: 423 tokens, Per-position acceptance rate: 0.972, 0.934, Avg Draft acceptance rate: 95.5%
(APIServer pid=11) INFO 11-20 04:17:53 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO: 10.35.56.2:54098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:18:33 [loggers.py:236] Engine 000: Avg prompt throughput: 413.0 tokens/s, Avg generation throughput: 55.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:18:33 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.94, Accepted throughput: 7.32 tokens/s, Drafted throughput: 7.56 tokens/s, Accepted: 366 tokens, Drafted: 378 tokens, Per-position acceptance rate: 0.979, 0.958, Avg Draft acceptance rate: 96.8%
(APIServer pid=11) INFO: 10.35.56.2:54098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:18:43 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 55.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:18:43 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 36.50 tokens/s, Drafted throughput: 38.00 tokens/s, Accepted: 365 tokens, Drafted: 380 tokens, Per-position acceptance rate: 0.979, 0.942, Avg Draft acceptance rate: 96.1%
(APIServer pid=11) INFO 11-20 04:18:53 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:19:43 [loggers.py:236] Engine 000: Avg prompt throughput: 410.1 tokens/s, Avg generation throughput: 142.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:19:43 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.93, Accepted throughput: 15.67 tokens/s, Drafted throughput: 16.27 tokens/s, Accepted: 940 tokens, Drafted: 976 tokens, Per-position acceptance rate: 0.990, 0.936, Avg Draft acceptance rate: 96.3%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:19:53 [loggers.py:236] Engine 000: Avg prompt throughput: 931.7 tokens/s, Avg generation throughput: 207.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:19:53 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.91, Accepted throughput: 136.19 tokens/s, Drafted throughput: 142.39 tokens/s, Accepted: 1362 tokens, Drafted: 1424 tokens, Per-position acceptance rate: 0.986, 0.927, Avg Draft acceptance rate: 95.6%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:20:03 [loggers.py:236] Engine 000: Avg prompt throughput: 856.4 tokens/s, Avg generation throughput: 210.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:20:03 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 138.09 tokens/s, Drafted throughput: 144.19 tokens/s, Accepted: 1381 tokens, Drafted: 1442 tokens, Per-position acceptance rate: 0.982, 0.933, Avg Draft acceptance rate: 95.8%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:20:13 [loggers.py:236] Engine 000: Avg prompt throughput: 942.5 tokens/s, Avg generation throughput: 207.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:20:13 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 136.19 tokens/s, Drafted throughput: 142.19 tokens/s, Accepted: 1362 tokens, Drafted: 1422 tokens, Per-position acceptance rate: 0.979, 0.937, Avg Draft acceptance rate: 95.8%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:20:23 [loggers.py:236] Engine 000: Avg prompt throughput: 664.4 tokens/s, Avg generation throughput: 210.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:20:23 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.90, Accepted throughput: 137.90 tokens/s, Drafted throughput: 145.00 tokens/s, Accepted: 1379 tokens, Drafted: 1450 tokens, Per-position acceptance rate: 0.974, 0.928, Avg Draft acceptance rate: 95.1%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:20:33 [loggers.py:236] Engine 000: Avg prompt throughput: 893.9 tokens/s, Avg generation throughput: 205.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:20:33 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.87, Accepted throughput: 133.99 tokens/s, Drafted throughput: 142.99 tokens/s, Accepted: 1340 tokens, Drafted: 1430 tokens, Per-position acceptance rate: 0.976, 0.898, Avg Draft acceptance rate: 93.7%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:20:43 [loggers.py:236] Engine 000: Avg prompt throughput: 1013.4 tokens/s, Avg generation throughput: 206.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:20:43 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 135.70 tokens/s, Drafted throughput: 141.40 tokens/s, Accepted: 1357 tokens, Drafted: 1414 tokens, Per-position acceptance rate: 0.989, 0.931, Avg Draft acceptance rate: 96.0%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:20:53 [loggers.py:236] Engine 000: Avg prompt throughput: 1596.5 tokens/s, Avg generation throughput: 200.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:20:53 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 131.69 tokens/s, Drafted throughput: 137.39 tokens/s, Accepted: 1317 tokens, Drafted: 1374 tokens, Per-position acceptance rate: 0.985, 0.932, Avg Draft acceptance rate: 95.9%
(APIServer pid=11) INFO 11-20 04:21:03 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 228.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:21:03 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.95, Accepted throughput: 150.79 tokens/s, Drafted throughput: 154.99 tokens/s, Accepted: 1508 tokens, Drafted: 1550 tokens, Per-position acceptance rate: 0.994, 0.952, Avg Draft acceptance rate: 97.3%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:21:13 [loggers.py:236] Engine 000: Avg prompt throughput: 879.1 tokens/s, Avg generation throughput: 207.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:21:13 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.90, Accepted throughput: 135.79 tokens/s, Drafted throughput: 142.99 tokens/s, Accepted: 1358 tokens, Drafted: 1430 tokens, Per-position acceptance rate: 0.976, 0.923, Avg Draft acceptance rate: 95.0%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=11) INFO 11-20 04:21:23 [loggers.py:236] Engine 000: Avg prompt throughput: 812.9 tokens/s, Avg generation throughput: 215.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=11) INFO 11-20 04:21:23 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.92, Accepted throughput: 141.39 tokens/s, Drafted throughput: 147.39 tokens/s, Accepted: 1414 tokens, Drafted: 1474 tokens, Per-position acceptance rate: 0.982, 0.936, Avg Draft acceptance rate: 95.9%
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1) with config: model='cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit', speculative_config=SpeculativeConfig(method='mtp', model='cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit', num_spec_tokens=2), tokenizer='cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1337, served_model_name=cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/a88a910a3c', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [3, 6, 9, 18, 24, 33, 42, 48, 57], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 57, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/a88a910a3c/rank_0_0/eagle_head'},
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-499ac363311842d587a5f34a1158e548,prompt_token_ids_len=2603,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=42, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([637, 638, 639], [640, 641, 642], [643, 644, 645], [646, 647, 648, 649, 650]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={chatcmpl-499ac363311842d587a5f34a1158e548: 2603}, total_num_scheduled_tokens=2603, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.008115942028985468, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={})
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] Traceback (most recent call last):
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] engine_core.run_busy_loop()
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] self._process_engine_step()
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 344, in step
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] model_output = self.model_executor.sample_tokens(grammar_output)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 107, in sample_tokens
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] return self.collective_rpc(
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 479, in run_method
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] return func(*args, **kwargs)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] return func(*args, **kwargs)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 539, in sample_tokens
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] return self.model_runner.sample_tokens(grammar_output)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] return func(*args, **kwargs)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2986, in sample_tokens
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ) = self._bookkeeping_sync(
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2480, in _bookkeeping_sync
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] valid_sampled_token_ids = self._to_list(sampled_token_ids)
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5097, in _to_list
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] self.transfer_event.synchronize()
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] super().synchronize()
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore_DP0 pid=150) ERROR 11-20 04:21:25 [core.py:844] (EngineCore_DP0 pid=150) Process EngineCore_DP0: (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] AsyncLLM output_handler failed. (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] Traceback (most recent call last): (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 477, in output_handler (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] outputs = await engine_core.get_output_async() (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 883, in get_output_async (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] raise self._format_exception(outputs) from None (APIServer pid=11) ERROR 11-20 04:21:25 [async_llm.py:525] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore_DP0 pid=150) Traceback (most recent call last): (APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (EngineCore_DP0 pid=150) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=150) self.run() (EngineCore_DP0 pid=150) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=150) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core (EngineCore_DP0 pid=150) raise e (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 835, in run_engine_core (EngineCore_DP0 pid=150) engine_core.run_busy_loop() (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 862, in run_busy_loop (EngineCore_DP0 pid=150) self._process_engine_step() (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 891, in _process_engine_step (EngineCore_DP0 pid=150) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 344, in step (EngineCore_DP0 pid=150) model_output = self.model_executor.sample_tokens(grammar_output) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 107, in sample_tokens (EngineCore_DP0 pid=150) return self.collective_rpc( (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc (EngineCore_DP0 pid=150) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 479, in run_method (EngineCore_DP0 pid=150) return func(*args, **kwargs) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=150) return func(*args, **kwargs) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 539, in sample_tokens (EngineCore_DP0 pid=150) return self.model_runner.sample_tokens(grammar_output) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=150) return func(*args, **kwargs) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2986, in sample_tokens (EngineCore_DP0 pid=150) ) = self._bookkeeping_sync( (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2480, in _bookkeeping_sync (EngineCore_DP0 pid=150) valid_sampled_token_ids = self._to_list(sampled_token_ids) (EngineCore_DP0 pid=150) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5097, in _to_list (EngineCore_DP0 pid=150) self.transfer_event.synchronize() (EngineCore_DP0 pid=150) File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize (EngineCore_DP0 pid=150) super().synchronize() (EngineCore_DP0 pid=150) torch.AcceleratorError: CUDA error: an illegal memory access was encountered (EngineCore_DP0 pid=150) Search forcudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=150) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=150) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=150) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_DP0 pid=150)
[rank0]:[W1120 04:21:25.921242116 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11) INFO: 10.35.56.2:39376 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=11) INFO: Shutting down
(APIServer pid=11) INFO: Waiting for application shutdown.
(APIServer pid=11) INFO: Application shutdown complete.
(APIServer pid=11) INFO: Finished server process [11]

found that only vllm v0.11.0 works fine

Sign up or log in to comment