ISTA-DASLab
/

Llama-3.1-8B-Instruct-MR-GPTQ-nvfp

@@ -1,17 +1,21 @@
 ---
-license: llama3.1
 library_name: transformers
-pipeline_tag: image-text-to-text
 tags:
 - int4
 - vllm
 - llmcompressor
-base_model:
-- meta-llama/Llama-3.1-8B-Instruct
 ---
 # Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
 ## Model Overview
 This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.
@@ -27,6 +31,42 @@ This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](htt
      - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
      - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.
 ## Evaluation
 This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.
@@ -45,7 +85,7 @@ Below we report recoveries on individual tasks as well as the average recovery.
 **Recovery by Task**
 | Task                               | Recovery (%) |
-|------------------------------------|--------------:|
 | SingleOp                           | 100.00 |
 | SingleQ                            | 98.99 |
 | MultiArith                         | 99.41 |
@@ -62,3 +102,17 @@ Below we report recoveries on individual tasks as well as the average recovery.
 | Winograd‑WSC                       | 89.47 |
 | **Average**                        | **96.29** |

 ---
+base_model:
+- meta-llama/Llama-3.1-8B-Instruct
 library_name: transformers
+license: llama3.1
+pipeline_tag: text-generation
 tags:
 - int4
 - vllm
 - llmcompressor
 ---
 # Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
+This model was presented in the paper [Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization](https://huggingface.co/papers/2509.23202).
+The official implementation code can be found on GitHub: [https://github.com/IST-DASLab/FP-Quant](https://github.com/IST-DASLab/FP-Quant)
 ## Model Overview
 This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.
      - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
      - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.
+### Example of quantized model inference with Hugging Face Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
+import torch
+model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="cuda",
+    torch_dtype=torch.bfloat16,
+)
+prompt = "Explain quantization for neural network in simple terms."
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+with torch.inference_mode():
+    output_tokens = model.generate(**inputs,max_new_tokens=150 )
+generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
+print(generated_text)
+```
+### Example of quantized model inference with vLLM engine
+```python
+from vllm import LLM, SamplingParams
+model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp"
+llm = LLM(model=model_name, dtype="bfloat16", gpu_memory_utilization=0.9)
+sampling_params = SamplingParams(
+    temperature=0.7,       # creativity
+    top_p=0.9,             # nucleus sampling
+    max_tokens=150,        # number of new tokens to generate
+)
+prompt = "Explain quantization for neural networks in simple terms."
+outputs = llm.generate([prompt], sampling_params)
+print(outputs[0].outputs[0].text)
+```
 ## Evaluation
 This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.
 **Recovery by Task**
 | Task                               | Recovery (%) |
+|------------------------------------|--------------:|\
 | SingleOp                           | 100.00 |
 | SingleQ                            | 98.99 |
 | MultiArith                         | 99.41 |
 | Winograd‑WSC                       | 89.47 |
 | **Average**                        | **96.29** |
+## Citation
+If you find this project useful, please cite our paper:
+```bibtex
+@misc{egiazarian2025bridginggappromiseperformance,
+      title={Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization},
+      author={Vage Egiazarian and Roberto L. Castro and Denis Kuznedelev and Andrei Panferov and Eldar Kurtic and Shubhra Pandit and Alexandre Marques and Mark Kurtz and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
+      year={2025},
+      eprint={2509.23202},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2509.23202},
+}
+```