---
base_model:
- meta-llama/Llama-3.1-8B-Instruct
library_name: transformers
license: llama3.1
pipeline_tag: text-generation
tags:
- int4
- vllm
- llmcompressor
---

# Llama-3.1-8B-Instruct-MR-GPTQ-nvfp

This model was presented in the paper [Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization](https://huggingface.co/papers/2509.23202).

The official implementation code can be found on GitHub: [https://github.com/IST-DASLab/FP-Quant](https://github.com/IST-DASLab/FP-Quant)

## Model Overview

This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.

## Usage 

*MR-GPTQ* quantized models with [QuTLASS](https://github.com/IST-DASLab/qutlass) kernels are supported in the following integrations:
 - `transformers` with these features:
     - Available in `main` ([Documentation](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant#fp-quant)).
     - RTN on-the-fly quantization.
     - Pseudo-quantization QAT.
 - `vLLM` with these features:
     - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
     - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.

### Example of quantized model inference with Hugging Face Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
import torch

model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)
prompt = "Explain quantization for neural network in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
    output_tokens = model.generate(**inputs,max_new_tokens=150 )
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(generated_text)
```
### Example of quantized model inference with vLLM engine
```python
from vllm import LLM, SamplingParams

model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp"
llm = LLM(model=model_name, dtype="bfloat16", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(
    temperature=0.7,       # creativity
    top_p=0.9,             # nucleus sampling
    max_tokens=150,        # number of new tokens to generate
)
prompt = "Explain quantization for neural networks in simple terms."
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
```

## Evaluation 

This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.

*OpenLLM v1 results*

| Model                                                                                           | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | **Average** | **Recovery (%)** |
|--------------------------------------------------------------------------------------------------|--------:|------:|----------:|-----------:|------------:|-----------------:|
| `meta‑llama/Llama 3.1‑8B‑Instruct`                                                               | 0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
| `ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑nvfp`                                                | 0.6917 | 0.8089 | 0.7850 | 0.7545 | 0.7600 | 96.29 |

*Platinum bench results*

Below we report recoveries on individual tasks as well as the average recovery.

**Recovery by Task**

| Task                               | Recovery (%) |
|------------------------------------|--------------:|\
| SingleOp                           | 100.00 |
| SingleQ                            | 98.99 |
| MultiArith                         | 99.41 |
| SVAMP                              | 97.54 |
| GSM8K                              | 96.64 |
| MMLU‑Math                          | 92.43 |
| BBH‑LogicalDeduction‑3Obj          | 87.34 |
| BBH‑ObjectCounting                 | 98.80 |
| BBH‑Navigate                       | 92.00 |
| TabFact                            | 86.92 |
| HotpotQA                           | 103.18 |
| SQuAD                              | 101.54 |
| DROP                               | 103.77 |
| Winograd‑WSC                       | 89.47 |
| **Average**                        | **96.29** |

## Citation
If you find this project useful, please cite our paper:

```bibtex
@misc{egiazarian2025bridginggappromiseperformance,
      title={Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization}, 
      author={Vage Egiazarian and Roberto L. Castro and Denis Kuznedelev and Andrei Panferov and Eldar Kurtic and Shubhra Pandit and Alexandre Marques and Mark Kurtz and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
      year={2025},
      eprint={2509.23202},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.23202}, 
}
```