Will it perform better on bfloat16?
#14
by
madmax0404
- opened
Please tell me
Probably depends on your GPUs? There are a lot of variables to optimizing both speed and quality. Generally speaking, a smaller quantization will be faster assuming there is a good backend implementation of the kernels for your exact hardware.
Here is what I'm getting on two older CUDA GPUs: https://www.reddit.com/r/LocalLLaMA/comments/1pj9r93/now_40_faster_ik_llamacpp_sm_graph_on_2x_cuda_gpus/
this was trained at fp8 so no, bf16 would just be upcasting and therefore run the same
this was trained at fp8 so no, bf16 would just be upcasting and therefore run the same
thanks bro