RTX 5090

by hdnminh - opened Feb 26

Feb 26

Thanks for this contribution.

Have you tested it yet? Is it running? Cuz I am hosting it on RTX5090 2 card 32GB, with vllm, but got cuda out of memory.

Sehyo

Owner Feb 28

Share your compose file please

turkenm

Mar 1

pip show vllm 
Name: vllm
Version: 0.16.0rc2.dev465+g8a685be8d
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
---
pip show transformers
Name: transformers
Version: 5.3.0.dev0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers

vllm serve Sehyo/Qwen3.5-35B-A3B-NVFP4 \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.85 \
  --async-scheduling \
  --max-num-seqs 2 \
  --limit-mm-per-prompt.video 0 \
  --mm-processor-cache-gb 0

It works with my local 5090 desktop.

Sehyo

Owner Mar 1

FYI I have updated the model to include MTP just now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment