ooof this fits in 4x96gb can we get this for the new 3.2 Speciale ase well please :)
ooof this fits in 4x96gb can we get this for the new 3.2 Speciale ase well please :)
you ran it on RTX PRO 6000? please share
you ran it on RTX PRO 6000? please share
oooh wait no I thought it was this one but its actually the intel autoround one
VLLM_SLEEP_WHEN_IDLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Intel/DeepSeek-V3.1-Terminus-int4-mixed-AutoRound/ --tensor-parallel-size 4 --served-model-name deepseek --max-model-len 58776 --tool-call-parser deepseek_v31 --enable-auto-tool-choice --tool-call-parser deepseek_v31 --enable-expert-parallel --gpu-memory-utilization 0.945 --trust-remote-code --port 8080 --enable-chunked-prefill --max-num-batched-tokens 2048 --block-size 8 --max-num-seqs 2 --chat-template examples/tool_chat_template_deepseekv31.jinja
you need this chat template to make tool calling fly seemlessly: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_deepseekv31.jinja
you can probably experiment with the settinsgs in the launch command. not sure if those settings were what i finally landed on
The DeepSeek-V3.2 series — including the Speciale version — is currently in our plan.
Yay amazing looking forward to it!
appreciate the lite versions as they fit in 384gb of vram (4x96gb). Now I can't remember if I got this one working with -tp 4 on rtx 6000s but its the same size about 360gb as the intel autoround one. I'll have to test this again. Or do you know off the bat what the difference might be between the two in terms of compatibility with rtx blackwell cards?
I’m not entirely sure about this. What we currently know is that Ampere-architecture GPUs cannot run the DeepSeek-V3.2 series models (DeepSeek-V3.2-Exp).
Started messing with Intel Autoround yesterday to see if I could quantize it using their library but Transformers was complaining that deepseek-v32 isn't supported yet. Which I found weird since exp has been out for some time already
👀
--max-model-len 58776
Very odd number 😂
Started messing with Intel Autoround yesterday to see if I could quantize it using their library but Transformers was complaining that deepseek-v32 isn't supported yet. Which I found weird since exp has been out for some time already
Same. I got the same error while quantizing it.
you ran it on RTX PRO 6000? please share
this AWQ runs in vLLM now too and I get bigger kv cache more bigger context window :)
GPU KV cache size: 107,104 tokens
VLLM_MARLIN_USE_ATOMIC_ADD=1 VLLM_SLEEP_WHEN_IDLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve QuantTrio/DeepSeek-V3.1-AWQ-Lite --tensor-parallel-size 4 --served-model-name deepseek --max-model-len 85000 --tool-call-parser deepseek_v31 --enable-auto-tool-choice --enable-expert-parallel --gpu-memory-utilization 0.96 --enable-chunked-prefill --max-num-batched-tokens 4096 --block-size 8 --max-num-seqs 8 --chat-template vllm/examples/tool_chat_template_deepseekv31.jinja
Avg prompt throughput: 1287.2 tokens/s, Avg generation throughput: 117.0 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.8%, Prefix cache hit rate: 42.9%
@Fernanda24
same command? it didn't crash? on 0.12.0?
yes crashed on a regular uv pip install -U vllm for 0.12.0 but works if you use the command above in this docker docker run -it --rm -v /models_dir:/models_dir/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host nvcr.io/nvidia/vllm:25.11-py3 bash
i think its 0.11.2dev something in the nvidia container. but it works (nvcr.io/nvidia/vllm:25.11-py3)