RTX 5090
#1
by hdnminh - opened
Thanks for this contribution.
Have you tested it yet? Is it running? Cuz I am hosting it on RTX5090 2 card 32GB, with vllm, but got cuda out of memory.
Share your compose file please
pip show vllm
Name: vllm
Version: 0.16.0rc2.dev465+g8a685be8d
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
---
pip show transformers
Name: transformers
Version: 5.3.0.dev0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers
vllm serve Sehyo/Qwen3.5-35B-A3B-NVFP4 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.85 \
--async-scheduling \
--max-num-seqs 2 \
--limit-mm-per-prompt.video 0 \
--mm-processor-cache-gb 0
It works with my local 5090 desktop.
FYI I have updated the model to include MTP just now.