Will it perform better on bfloat16?

#14
by madmax0404 - opened

Please tell me

Probably depends on your GPUs? There are a lot of variables to optimizing both speed and quality. Generally speaking, a smaller quantization will be faster assuming there is a good backend implementation of the kernels for your exact hardware.

Here is what I'm getting on two older CUDA GPUs: https://www.reddit.com/r/LocalLLaMA/comments/1pj9r93/now_40_faster_ik_llamacpp_sm_graph_on_2x_cuda_gpus/

this was trained at fp8 so no, bf16 would just be upcasting and therefore run the same

(unless you meant speed-performance like @ubergarm seemed to assume)

this was trained at fp8 so no, bf16 would just be upcasting and therefore run the same

thanks bro

Sign up or log in to comment