Request for BF16 mmproj
The base model seems to be using BF16 instead of FP16, so I think it makes sense to for users to use BF16 for the mmproj file if possible, because I can observe differences between BF16 and FP16 for vision (especially for transcription tasks)
F16 is 8 times more precise than BF16 so all BF16 values can be exactly represented in F16 unless they exceed the F16 range which almost never happens unless the model is flawed in which case llama.cpp errors using inf/-inf as far I'm aware. While I have not checked in this particular model it is very likely that all the numeric values are exactly the same and there would be no numeric difference between an F16 and BF16 version which would obviously make spotting any difference impossible. It seams far more likely that the llama.cpp implementation of the vision stack slightly deviates from that one you tried using transformers or vLLM.
We still keep your feedback in mind and and might upload the source mmproj in the future once we implement support for the recently merged https://github.com/ggml-org/llama.cpp/pull/16592
Thanks for the reply, but I almost always stick to llama.cpp as my execution backend, so I am not referring to any other vision implementations other than the one in llama.cpp. I can observe consistent differences in transcription output using LM Studio (sadly I just realized they automatically shrink all input images down to a pitiful size), with an A/B comparison of the FP16, BF16 and FP32 mmproj files. To me, it feels that BF16 gives an output that is closer (if my memory serves me right, exactly identical) to FP32 compared to FP16.
I think it might not be so simple to say that FP16 is more precise than BF16, because the extra exponent bits of BF16 should allow it to represent smaller non-zero values than FP16, not just larger values. BF16 uses an exponent bias/excess of 127, while FP16 seems to use a bias of 15, so the smallest possible exponent value for BF16 is a smaller negative number. I am not sure how subnormal values come into play, but I tried using the following online converter to explore the possible values: https://flop.evanau.dev/brainfloat-converter. It does seem that I can obtain a smaller nonzero number with BF16 by setting only the last mantissa bit to 1.
To me, it feels that BF16 gives an output that is closer (if my memory serves me right, exactly identical) to FP32 compared to FP16.
nanonets/Nanonets-OCR2-3B is uploaded in BF16. FP32 models are rare as nobody wants to spend twice the money training a model without any real benefits. Most go for BF16 for training stability (no risk of overflowing) unless they need the precession in which case, they go FP16. FP32 is total overkill for LLMs.
It does seem that I can obtain a smaller nonzero number with BF16 by setting only the last mantissa bit to 1.
Are you sure about that? I get the exact opposite result on the very same webpage you linked. With just setting last mantissa bit to 1 I'm getting:
0.0078125 for BF16
0.0009765625 for FP16
So the smallest possible non-zero number for FP16 seems to be 8 times smaller than for BF16. This makes sense as FP16 has a larger mantissa offering 8 times more precision.
I wouldn't expect numbers such close to zero or larger than 65504 to matter for LLMs but maybe MMPROJ are different. I might modify llama.cpp to print them to check with what range of numbers we are dealing with as I never really looked at an MMPROJ but I don't expect the numeric range to be much different than for LLM tensors.
Are you sure about that? I get the exact opposite result on the very same webpage you linked. With just setting last mantissa bit to 1 I'm getting:
0.0078125 for BF16
0.0009765625 for FP16
Yes, I am sure about that actually. The numbers you quoted are just for the mantissa portion, but you need to multiply it with the exponent (and the sign) to get the actual number. By setting the notation to scientific, then clicking and dragging to the extreme right of the "value stored" field (since the number is too long to fit visually), it shows e-41.
Yet if I do the same for FP16, it shows e-8 instead.
This is because setting all the bits of the exponent field to 0 does not mean the exponent is actually 0, due to the exponent bias/excess, it actually represents the smallest possible number (i.e. negative number with the largest magnitude) for the exponent.
I wouldn't expect numbers such close to zero or larger than 65504 to matter for LLMs but maybe MMPROJ are different.
Yeah, I have a strong suspicion that the vision components are more sensitive than the text portion.
You are right. It makes sense as BF16 has a far larger exponent so if there are numbers smaller than 0.000000059604644775390625 that matter than BF16 indeed is a better choice. I have a hard time believing that such tiny numbers would matter. I'm quite curious now what kind of numbers are inside an MMPROJ. I will see if there is an easy way to have convert_hf_to_gguf.py print them.
Yeah, I have a strong suspicion that the vision components are more sensitive than the text portion.
It for sure is which is why until yesterday you couldn't quant it further than Q8_0.
