--- base_model: - meta-llama/Llama-3.1-8B-Instruct library_name: transformers license: llama3.1 pipeline_tag: text-generation tags: - int4 - vllm - llmcompressor --- # Llama-3.1-8B-Instruct-MR-GPTQ-nvfp This model was presented in the paper [Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization](https://huggingface.co/papers/2509.23202). The official implementation code can be found on GitHub: [https://github.com/IST-DASLab/FP-Quant](https://github.com/IST-DASLab/FP-Quant) ## Model Overview This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%. ## Usage *MR-GPTQ* quantized models with [QuTLASS](https://github.com/IST-DASLab/qutlass) kernels are supported in the following integrations: - `transformers` with these features: - Available in `main` ([Documentation](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant#fp-quant)). - RTN on-the-fly quantization. - Pseudo-quantization QAT. - `vLLM` with these features: - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440). - Compatible with real quantization models from `FP-Quant` and the `transformers` integration. ### Example of quantized model inference with Hugging Face Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig import torch model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="cuda", torch_dtype=torch.bfloat16, ) prompt = "Explain quantization for neural network in simple terms." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.inference_mode(): output_tokens = model.generate(**inputs,max_new_tokens=150 ) generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True) print(generated_text) ``` ### Example of quantized model inference with vLLM engine ```python from vllm import LLM, SamplingParams model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp" llm = LLM(model=model_name, dtype="bfloat16", gpu_memory_utilization=0.9) sampling_params = SamplingParams( temperature=0.7, # creativity top_p=0.9, # nucleus sampling max_tokens=150, # number of new tokens to generate ) prompt = "Explain quantization for neural networks in simple terms." outputs = llm.generate([prompt], sampling_params) print(outputs[0].outputs[0].text) ``` ## Evaluation This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine. *OpenLLM v1 results* | Model | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | **Average** | **Recovery (%)** | |--------------------------------------------------------------------------------------------------|--------:|------:|----------:|-----------:|------------:|-----------------:| | `meta‑llama/Llama 3.1‑8B‑Instruct` | 0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – | | `ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑nvfp` | 0.6917 | 0.8089 | 0.7850 | 0.7545 | 0.7600 | 96.29 | *Platinum bench results* Below we report recoveries on individual tasks as well as the average recovery. **Recovery by Task** | Task | Recovery (%) | |------------------------------------|--------------:|\ | SingleOp | 100.00 | | SingleQ | 98.99 | | MultiArith | 99.41 | | SVAMP | 97.54 | | GSM8K | 96.64 | | MMLU‑Math | 92.43 | | BBH‑LogicalDeduction‑3Obj | 87.34 | | BBH‑ObjectCounting | 98.80 | | BBH‑Navigate | 92.00 | | TabFact | 86.92 | | HotpotQA | 103.18 | | SQuAD | 101.54 | | DROP | 103.77 | | Winograd‑WSC | 89.47 | | **Average** | **96.29** | ## Citation If you find this project useful, please cite our paper: ```bibtex @misc{egiazarian2025bridginggappromiseperformance, title={Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization}, author={Vage Egiazarian and Roberto L. Castro and Denis Kuznedelev and Andrei Panferov and Eldar Kurtic and Shubhra Pandit and Alexandre Marques and Mark Kurtz and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh}, year={2025}, eprint={2509.23202}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2509.23202}, } ```