How to make MXFP4 Quants

#1
by DatToad - opened

Thanks for all your efforts on MXFP4 MLX quants. Do you have a script for making the quants, or commands I could use? I wouldn't mind making updated versions of other models.

Hey @DatToad , happy to help! You can do this in a one-liner actually. This is the exact command I ran to quantize and upload this model: mlx_lm.convert --hf-path "zerofata/GLM-4.5-Iceblink-v2-106B-A12B" -q --q-mode mxfp4 --q-group-size 32 --upload-repo "beezu/zerofata_GLM-4.5-Iceblink-v2-106B-A12B-MLX-MXFP4".

Note that this is more RAM and GPU intensive than normal quants, and will take longer as well. I'm on a 128GB M3 Max, with the VRAM limit set at 116GB, and I ended up filling the VRAM and using about 10GB of swap space at points during this model's quantization. I had GPU timeouts (causing it to fail mid quantization) a couple of times, likely due to slow swap speeds. What ultimately worked for me was rebooting my Mac, launching the terminal and nothing else, and starting the quantization. Those steps are probably overkill for making mxfp4 quants of smaller models, but I figured I'd mention it to save you some headache if you run into those errors too.

Sign up or log in to comment