Offloading layers is not working for me

by tnuvkeg - opened 6 days ago

I can run Unsloth UD-Q4_K_XL (245gb) by offloading some layers to CPU with ik_llama (or llama.cpp) but no matter which layers I try to offload, I can't get IQ2_XS (132gb) to load because of memory issues.

Does the "new fused Up + Gate conversion" affects offloading in any way?

dehnhaide

6 days ago

Even if you engage with --"fit on"?

AesSedai

Owner 6 days ago

•

edited 6 days ago

Hi @tnuvkeg , have you pulled and compiled llama.cpp recently? There was a PR that just went in to fix mixed CPU + GPU offloading for the fused up + gate conversion: https://github.com/ggml-org/llama.cpp/pull/20910

tnuvkeg

5 days ago

•

edited 5 days ago

Even if you engage with --"fit on"?

if in ik_llama I try with "--fit" I get "llama_model_load: error loading model: Manual tensor overrides cannot be used with --fit
"
("--fit on" I get "error: unknown argument: on")

in llama.cpp I get the same error about memory

tnuvkeg

5 days ago

Hi @tnuvkeg , have you pulled and compiled llama.cpp recently? There was a PR that just went in to fix mixed CPU + GPU offloading for the fused up + gate conversion: https://github.com/ggml-org/llama.cpp/pull/20910

Hi,
yes, I always keep both ik_llama and llama.cpp updated, just tried again with latest versions, and still the same... anyway, I guess it's only me, so no need to worry...

AesSedai

Owner 5 days ago

Ah, ik_llama does not have the --fit flag. The fused up + gate is the same amount of weights overall but since that is fused into one tensor it is a bit tricker to get it to load onto the available space. The only PR I'm aware of is that 20910 I linked previously, but if you still have a problem in llama.cpp maybe open an issue with some of the logs from your server load? It sounds like another bug that may need fixing perhaps.

tnuvkeg

3 days ago

I need to look further into this, I could run big models just fine until yesterday where I updated ik_llama, then I needed to "downgrade" ik_llama to run the other models.
Although when I tried to run your IQ2_XS, I could load the Unsloth one just fine (and other big models like Kimi or Deepseek)... I'll keep an eye and try again...

AesSedai

Owner 3 days ago

It might be related to how the layers are different sizes now? The bpw-selection algorithm from @eaddario can lead to uneven layer sizes and maybe that's causing some form of loading issue?

Using llama.cpp and the regular --fit worked for me since I loaded all of them to test PPL / KLD, so I think it might be a balancing issue of some kind perhaps?

dehnhaide

2 days ago

•

edited 2 days ago

Ah, ik_llama does not have the --fit flag.

Actually it does... some dear developer at ik_llama (we all know very well) has blessed us with:
--fit-margin N --> safety margin in MiB when auto-fitting model offloading
--fit --> automatically determine which tensors to offload to the GPU(s)
Extra bonus, as per https://github.com/ikawrakow/ik_llama.cpp/pull/1540, now "-ts" based manual splits are honored

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment