Wrong output
#2
by
bullerwins
- opened
The model doesn't output anything, but it seems to be generating tokens in vllm.
Launch command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 VLLM_PP_LAYER_PARTITION=8,6,23,6,6,6,7 vllm serve \
/mnt/llms/models/QuantTrio/MiniMax-M2-AWQ/ \
--served-model-name MiniMax-M2-AWQ \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 32000 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1 -pp 7 \
--enable-expert-parallel \
--trust-remote-code \
--disable-log-requests \
--host 0.0.0.0 \
--port 5000
System:
CUDA0=5090
CUDA1=3090
CUDA2=rtx6000
CUDA3=3090
CUDA4=3090
CUDA5=3090
CUDA6=5090
curl http://192.168.10.115:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMax-M2-AWQ",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the benefits of distributed training."}
],
"max_tokens": 300,
"temperature": 0.7
}'
Response:
{"id":"chatcmpl-182ff17cc60b4e1fa9269406320996b3","object":"chat.completion","created":1761677032,"model":"MiniMax-M2-AWQ","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":30,"total_tokens":330,"completion_tokens":300,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
Mixing Blackwell and Ada in one run is pretty tough—vLLM’s Blackwell support is still maturing. At this point, the most direct path is to report it upstream to the vLLM team.
Mixing Blackwell and Ada in one run is pretty tough—vLLM’s Blackwell support is still maturing. At this point, the most direct path is to report it upstream to the vLLM team.
Your quant for GLM-4.6 worked perfectly though with this same setup, so I was wondering if it was something with this model