A little confused on memory usage (vLLM newbie)
Hi all ,
Thanks for producing these models, which seem very interesting! I am completely new to vLLM, but something is puzzling me.
I used vllm serve and could get the model working as follows:
vllm serve cerebras/GLM-4.5-Air-REAP-82B-A12B
--max_num_batched_tokens 16384
--max_model_len 16384
--gpu-memory-utilization 0.95
But, it was extremely slow (0.3 tps). I then checked memory usage: I had 118GB utilized, with another ~50GB in swap. No wired memory was used (I am on an m4 Max with 128gb).
~120GB + 50GB swap = 170GB usage. Why is that the case when the model is ~82GB in size? Why does it essentially double in size?
I am sure this has something to do with bits and it not being quantized? But can someone explain this? Are there any ways to keep it at 82GB?
Thank you!
@x-polyglot-x
Thanks for trying our model out. These weights are in BF16 (so its 2 bytes/parameter), we are planning to release an FP8 checkpoint soon as well which will help in your case. For now, you can also try playing with the other vllm settings such as --max-num-seqs 32 and/or do inflight quantization --quantization bitsandbytes and --load-format bitsandbytes (see here: https://docs.vllm.ai/en/latest/features/quantization/bnb.html)
Greetings!
Thank you for that detailed explanation. I imagined it was related to FP8 :).
Thanks for that reference on BitsAndBytes - that's interesting. I look forward to testing more of your models in the future!
@x-polyglot-x we've just uploaded the FP8 version, check it out: https://hf.co/cerebras/GLM-4.5-Air-REAP-82B-A12B-FP8