RTX 5090

#1
by hdnminh - opened

Thanks for this contribution.

Have you tested it yet? Is it running? Cuz I am hosting it on RTX5090 2 card 32GB, with vllm, but got cuda out of memory.

Owner

Share your compose file please

pip show vllm 
Name: vllm
Version: 0.16.0rc2.dev465+g8a685be8d
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
---
pip show transformers
Name: transformers
Version: 5.3.0.dev0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers
vllm serve Sehyo/Qwen3.5-35B-A3B-NVFP4 \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.85 \
  --async-scheduling \
  --max-num-seqs 2 \
  --limit-mm-per-prompt.video 0 \
  --mm-processor-cache-gb 0

It works with my local 5090 desktop.

Owner

FYI I have updated the model to include MTP just now.

Sign up or log in to comment