Offloading layers is not working for me

#9
by tnuvkeg - opened

I can run Unsloth UD-Q4_K_XL (245gb) by offloading some layers to CPU with ik_llama (or llama.cpp) but no matter which layers I try to offload, I can't get IQ2_XS (132gb) to load because of memory issues.

Does the "new fused Up + Gate conversion" affects offloading in any way?

Even if you engage with --"fit on"?

Hi @tnuvkeg , have you pulled and compiled llama.cpp recently? There was a PR that just went in to fix mixed CPU + GPU offloading for the fused up + gate conversion: https://github.com/ggml-org/llama.cpp/pull/20910

Even if you engage with --"fit on"?

if in ik_llama I try with "--fit" I get "llama_model_load: error loading model: Manual tensor overrides cannot be used with --fit
"
("--fit on" I get "error: unknown argument: on")

in llama.cpp I get the same error about memory

Hi @tnuvkeg , have you pulled and compiled llama.cpp recently? There was a PR that just went in to fix mixed CPU + GPU offloading for the fused up + gate conversion: https://github.com/ggml-org/llama.cpp/pull/20910

Hi,
yes, I always keep both ik_llama and llama.cpp updated, just tried again with latest versions, and still the same... anyway, I guess it's only me, so no need to worry...

Ah, ik_llama does not have the --fit flag. The fused up + gate is the same amount of weights overall but since that is fused into one tensor it is a bit tricker to get it to load onto the available space. The only PR I'm aware of is that 20910 I linked previously, but if you still have a problem in llama.cpp maybe open an issue with some of the logs from your server load? It sounds like another bug that may need fixing perhaps.

I need to look further into this, I could run big models just fine until yesterday where I updated ik_llama, then I needed to "downgrade" ik_llama to run the other models.
Although when I tried to run your IQ2_XS, I could load the Unsloth one just fine (and other big models like Kimi or Deepseek)... I'll keep an eye and try again...

It might be related to how the layers are different sizes now? The bpw-selection algorithm from @eaddario can lead to uneven layer sizes and maybe that's causing some form of loading issue?

Using llama.cpp and the regular --fit worked for me since I loaded all of them to test PPL / KLD, so I think it might be a balancing issue of some kind perhaps?

Ah, ik_llama does not have the --fit flag.

Actually it does... some dear developer at ik_llama (we all know very well) has blessed us with:
--fit-margin N --> safety margin in MiB when auto-fitting model offloading
--fit --> automatically determine which tensors to offload to the GPU(s)
Extra bonus, as per https://github.com/ikawrakow/ik_llama.cpp/pull/1540, now "-ts" based manual splits are honored

Sign up or log in to comment