Improve metadata and add usage examples, paper/code links, and citation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +59 -5
README.md CHANGED
@@ -1,17 +1,21 @@
1
  ---
2
- license: llama3.1
 
3
  library_name: transformers
4
- pipeline_tag: image-text-to-text
 
5
  tags:
6
  - int4
7
  - vllm
8
  - llmcompressor
9
- base_model:
10
- - meta-llama/Llama-3.1-8B-Instruct
11
  ---
12
 
13
  # Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
14
 
 
 
 
 
15
  ## Model Overview
16
 
17
  This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.
@@ -27,6 +31,42 @@ This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](htt
27
  - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
28
  - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Evaluation
31
 
32
  This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.
@@ -45,7 +85,7 @@ Below we report recoveries on individual tasks as well as the average recovery.
45
  **Recovery by Task**
46
 
47
  | Task | Recovery (%) |
48
- |------------------------------------|--------------:|
49
  | SingleOp | 100.00 |
50
  | SingleQ | 98.99 |
51
  | MultiArith | 99.41 |
@@ -62,3 +102,17 @@ Below we report recoveries on individual tasks as well as the average recovery.
62
  | Winograd‑WSC | 89.47 |
63
  | **Average** | **96.29** |
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - meta-llama/Llama-3.1-8B-Instruct
4
  library_name: transformers
5
+ license: llama3.1
6
+ pipeline_tag: text-generation
7
  tags:
8
  - int4
9
  - vllm
10
  - llmcompressor
 
 
11
  ---
12
 
13
  # Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
14
 
15
+ This model was presented in the paper [Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization](https://huggingface.co/papers/2509.23202).
16
+
17
+ The official implementation code can be found on GitHub: [https://github.com/IST-DASLab/FP-Quant](https://github.com/IST-DASLab/FP-Quant)
18
+
19
  ## Model Overview
20
 
21
  This model was obtained by quantizing the weights of [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.
 
31
  - Available in [this PR](https://github.com/vllm-project/vllm/pull/24440).
32
  - Compatible with real quantization models from `FP-Quant` and the `transformers` integration.
33
 
34
+ ### Example of quantized model inference with Hugging Face Transformers
35
+ ```python
36
+ from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
37
+ import torch
38
+
39
+ model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp"
40
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
41
+
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ model_name,
44
+ device_map="cuda",
45
+ torch_dtype=torch.bfloat16,
46
+ )
47
+ prompt = "Explain quantization for neural network in simple terms."
48
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
49
+ with torch.inference_mode():
50
+ output_tokens = model.generate(**inputs,max_new_tokens=150 )
51
+ generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
52
+ print(generated_text)
53
+ ```
54
+ ### Example of quantized model inference with vLLM engine
55
+ ```python
56
+ from vllm import LLM, SamplingParams
57
+
58
+ model_name = "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp"
59
+ llm = LLM(model=model_name, dtype="bfloat16", gpu_memory_utilization=0.9)
60
+ sampling_params = SamplingParams(
61
+ temperature=0.7, # creativity
62
+ top_p=0.9, # nucleus sampling
63
+ max_tokens=150, # number of new tokens to generate
64
+ )
65
+ prompt = "Explain quantization for neural networks in simple terms."
66
+ outputs = llm.generate([prompt], sampling_params)
67
+ print(outputs[0].outputs[0].text)
68
+ ```
69
+
70
  ## Evaluation
71
 
72
  This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the `vLLM` engine.
 
85
  **Recovery by Task**
86
 
87
  | Task | Recovery (%) |
88
+ |------------------------------------|--------------:|\
89
  | SingleOp | 100.00 |
90
  | SingleQ | 98.99 |
91
  | MultiArith | 99.41 |
 
102
  | Winograd‑WSC | 89.47 |
103
  | **Average** | **96.29** |
104
 
105
+ ## Citation
106
+ If you find this project useful, please cite our paper:
107
+
108
+ ```bibtex
109
+ @misc{egiazarian2025bridginggappromiseperformance,
110
+ title={Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization},
111
+ author={Vage Egiazarian and Roberto L. Castro and Denis Kuznedelev and Andrei Panferov and Eldar Kurtic and Shubhra Pandit and Alexandre Marques and Mark Kurtz and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
112
+ year={2025},
113
+ eprint={2509.23202},
114
+ archivePrefix={arXiv},
115
+ primaryClass={cs.LG},
116
+ url={https://arxiv.org/abs/2509.23202},
117
+ }
118
+ ```