A little confused on memory usage (vLLM newbie)

#12
by x-polyglot-x - opened

Hi all ,

Thanks for producing these models, which seem very interesting! I am completely new to vLLM, but something is puzzling me.

I used vllm serve and could get the model working as follows:
vllm serve cerebras/GLM-4.5-Air-REAP-82B-A12B
--max_num_batched_tokens 16384
--max_model_len 16384
--gpu-memory-utilization 0.95

But, it was extremely slow (0.3 tps). I then checked memory usage: I had 118GB utilized, with another ~50GB in swap. No wired memory was used (I am on an m4 Max with 128gb).

~120GB + 50GB swap = 170GB usage. Why is that the case when the model is ~82GB in size? Why does it essentially double in size?

I am sure this has something to do with bits and it not being quantized? But can someone explain this? Are there any ways to keep it at 82GB?

Thank you!

Cerebras org

@x-polyglot-x Thanks for trying our model out. These weights are in BF16 (so its 2 bytes/parameter), we are planning to release an FP8 checkpoint soon as well which will help in your case. For now, you can also try playing with the other vllm settings such as --max-num-seqs 32 and/or do inflight quantization --quantization bitsandbytes and --load-format bitsandbytes (see here: https://docs.vllm.ai/en/latest/features/quantization/bnb.html)

Greetings!

Thank you for that detailed explanation. I imagined it was related to FP8 :).

Thanks for that reference on BitsAndBytes - that's interesting. I look forward to testing more of your models in the future!

x-polyglot-x changed discussion status to closed
Cerebras org

@x-polyglot-x we've just uploaded the FP8 version, check it out: https://hf.co/cerebras/GLM-4.5-Air-REAP-82B-A12B-FP8

@lazarevich Thank you very much! :)

Sign up or log in to comment