How much vram is needed to run this model? 8xRTX3090=192GB isn't enough to run the context.

#12
by kq - opened

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 && vllm serve /models/tclf90/Qwen3-VL-235B-A22B-Thinking-AWQ --enable-expert-parallel --api-key token-deaf --port 12303 --gpu-memory-utilization 0.98 --max-num-seqs 16 --max-model-len 131072 --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name qwen3-vl-thinking

Minimal VLLM Failure Log
Pre-Failure Process
Compilation Time: Workers spend approximately 37 seconds on torch.compile (Dynamo bytecode transform) and then load the compiled graphs in ~13 seconds.
Note: The main process logged a warning about processes hanging/compiling for 60 seconds.
KV Cache Allocation Check: Before the crash, the workers report the available KV cache memory:
Available KV Cache Memory per Worker: 2.55 GiB
Critical Failure (VRAM Insufficiency) πŸ›‘
Error Reason: EngineCore failed to start.
Root Cause: ValueError: To serve at least one request with the models's max seq len (131072), (5.88 GiB KV cache is needed, which is larger than the available KV cache memory (2.55 GiB).
Diagnosis: The engine attempted to initialize the KV cache but found the required memory (5.88 GiB per worker) was more than double the available memory (2.55 GiB per worker).
Recommended Action:
Increase gpu_memory_utilization (if possible, though it was already at 0.98).
Decrease max_model_len.
Estimated Max Length: Based on the available memory, the estimated maximum model length is 56800 (down from the requested 131072).
Final Status: The API server received a RuntimeError: Engine core initialization failed and the process shut down.

kq changed discussion title from how much vram is needed to run this model? 8xRTX3090=192GB isn't enough to run the context. to How much vram is needed to run this model? 8xRTX3090=192GB isn't enough to run the context.

Sign up or log in to comment