ooof this fits in 4x96gb can we get this for the new 3.2 Speciale ase well please :)

by Fernanda24 - opened 4 days ago

Discussion

Fernanda24

4 days ago

ooof this fits in 4x96gb can we get this for the new 3.2 Speciale ase well please :)

willfalco

3 days ago

you ran it on RTX PRO 6000? please share

Fernanda24

3 days ago

•

edited 3 days ago

you ran it on RTX PRO 6000? please share

oooh wait no I thought it was this one but its actually the intel autoround one

VLLM_SLEEP_WHEN_IDLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Intel/DeepSeek-V3.1-Terminus-int4-mixed-AutoRound/ --tensor-parallel-size 4 --served-model-name deepseek --max-model-len 58776 --tool-call-parser deepseek_v31 --enable-auto-tool-choice --tool-call-parser deepseek_v31 --enable-expert-parallel --gpu-memory-utilization 0.945 --trust-remote-code --port 8080 --enable-chunked-prefill --max-num-batched-tokens 2048 --block-size 8 --max-num-seqs 2 --chat-template examples/tool_chat_template_deepseekv31.jinja

you need this chat template to make tool calling fly seemlessly: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_deepseekv31.jinja

you can probably experiment with the settinsgs in the launch command. not sure if those settings were what i finally landed on

JunHowie

QuantTrio org 3 days ago

The DeepSeek-V3.2 series — including the Speciale version — is currently in our plan.

Fernanda24

3 days ago

•

edited 3 days ago

@JunHowie

Yay amazing looking forward to it!

appreciate the lite versions as they fit in 384gb of vram (4x96gb). Now I can't remember if I got this one working with -tp 4 on rtx 6000s but its the same size about 360gb as the intel autoround one. I'll have to test this again. Or do you know off the bat what the difference might be between the two in terms of compatibility with rtx blackwell cards?

JunHowie

QuantTrio org 3 days ago

I’m not entirely sure about this. What we currently know is that Ampere-architecture GPUs cannot run the DeepSeek-V3.2 series models (DeepSeek-V3.2-Exp).

willfalco

2 days ago

@Fernanda24
Thanks
Guess now we focus on figuring how to fit new DeepSeek on 4 x RTX 6000

Fernanda24

1 day ago

@willfalco

Started messing with Intel Autoround yesterday to see if I could quantize it using their library but Transformers was complaining that deepseek-v32 isn't supported yet. Which I found weird since exp has been out for some time already

Fernanda24

1 day ago

•

edited 1 day ago

@willfalco this also gives hope https://github.com/IISuperluminaLII/FlashMLA_Windows_Linux_sm120/

tclf90

QuantTrio org 1 day ago

👀

mtcl

1 day ago

--max-model-len 58776

Very odd number 😂

mtcl

1 day ago

@willfalco

Started messing with Intel Autoround yesterday to see if I could quantize it using their library but Transformers was complaining that deepseek-v32 isn't supported yet. Which I found weird since exp has been out for some time already

Same. I got the same error while quantizing it.

Fernanda24

about 12 hours ago

•

edited about 9 hours ago

you ran it on RTX PRO 6000? please share

this AWQ runs in vLLM now too and I get bigger kv cache more bigger context window :)

GPU KV cache size: 107,104 tokens

VLLM_MARLIN_USE_ATOMIC_ADD=1 VLLM_SLEEP_WHEN_IDLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve QuantTrio/DeepSeek-V3.1-AWQ-Lite --tensor-parallel-size 4 --served-model-name deepseek --max-model-len 85000 --tool-call-parser deepseek_v31 --enable-auto-tool-choice --enable-expert-parallel --gpu-memory-utilization 0.96 --enable-chunked-prefill --max-num-batched-tokens 4096 --block-size 8 --max-num-seqs 8 --chat-template vllm/examples/tool_chat_template_deepseekv31.jinja

Avg prompt throughput: 1287.2 tokens/s, Avg generation throughput: 117.0 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.8%, Prefix cache hit rate: 42.9%

willfalco

about 9 hours ago

@Fernanda24
same command? it didn't crash? on 0.12.0?

Fernanda24

about 8 hours ago

•

edited about 8 hours ago

@Fernanda24
same command? it didn't crash? on 0.12.0?

yes crashed on a regular uv pip install -U vllm for 0.12.0 but works if you use the command above in this docker docker run -it --rm -v /models_dir:/models_dir/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host nvcr.io/nvidia/vllm:25.11-py3 bash

i think its 0.11.2dev something in the nvidia container. but it works (nvcr.io/nvidia/vllm:25.11-py3)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment