nielsr HF Staff commited on
Commit
39c2608
·
verified ·
1 Parent(s): 4133e63

Add pipeline tag and library name; include extended description from Github README

Browse files

This PR adds the `pipeline_tag` and `library_name` to the model card, ensuring that the model is discoverable on the Hugging Face Hub.
It also includes a more detailed description from the Github README to provide users with a better understanding of the model, and links to the paper and github repo.

Files changed (1) hide show
  1. README.md +321 -1
README.md CHANGED
@@ -1,6 +1,326 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
 
 
 
 
4
  This repository contains the model zoo of [FlatQuant: Flatness Matters for LLM Quantization](https://arxiv.org/abs/2410.09426).
5
 
6
- The official code can be found at [https://github.com/ruikangliu/FlatQuant](https://github.com/ruikangliu/FlatQuant).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: transformers
4
+ pipeline_tag: feature-extraction
5
  ---
6
+
7
+ # FlatQuant: Flatness Matters for LLM Quantization
8
+
9
+ [![arXiv](https://img.shields.io/badge/FlatQuant-2410.09426-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.09426)
10
+
11
  This repository contains the model zoo of [FlatQuant: Flatness Matters for LLM Quantization](https://arxiv.org/abs/2410.09426).
12
 
13
+ The official code can be found at [https://github.com/ruikangliu/FlatQuant](https://github.com/ruikangliu/FlatQuant).
14
+
15
+ ---
16
+
17
+ FlatQuant leverages Fast and Learnable Affine Transformations tailored for each linear layer to alleviate outliers in LLMs. Additionally, as indicated by the name, it also achieves pretty flat weights and activations that are friendly to quantization. FlatQuant significantly enhances the quantization accuracy under a low-bit quantization setting (i.e., W4A4) while introducing little inference overhead, which may help promote the deployment of W4A4-quantized LLMs.
18
+
19
+ ![method](figures/FlatQuant.jpg)
20
+
21
+ ## News 🔥
22
+
23
+ - [2025/05] We now support **fake quantized inference** in **vLLM**.
24
+ - [2025/05] FlatQuant for **DeepSeek V3/R1** is now available!
25
+ - [2025/05] Our paper has been **accepted to ICML 2025**! 🎉
26
+ - [2024/11] Pre-trained transformation matrices of FlatQuant are now available at [modelzoo](#model-zoo).
27
+ - [2024/10] FlatQuant is **publicly released**! Check our paper [here](https://arxiv.org/abs/2410.09426).
28
+
29
+ ## Contents
30
+
31
+ - [Preparations](#preparations)
32
+ - [Usage](#usage)
33
+ - [Model Zoo](#model-zoo)
34
+ - [Results](#results)
35
+ - [Acknowledgements](#acknowledgements)
36
+ - [References](#references)
37
+
38
+ ## Preparations
39
+
40
+ ### Installation
41
+
42
+ ```bash
43
+ conda create -n flatquant python=3.10 -y
44
+ conda activate flatquant
45
+ pip install -r requirements.txt && pip install -e . && pip install triton==3.0.0
46
+ ```
47
+
48
+ **Note:** To run models like LLaMA-3.1 or Qwen-2.5, we use `transformers==4.45.0` instead.
49
+
50
+ ### Data Preparation
51
+
52
+ Download datasets in `./datasets`.
53
+
54
+ **Calibration set or PPL evaluation**
55
+
56
+ | Dataset | Local Dir | URL |
57
+ | --------- | -------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
58
+ | WikiText2 | ./datasets/wikitext | [https://huggingface.co/datasets/wikitext](https://huggingface.co/datasets/wikitext) |
59
+ | C4 | ./datasets/allenai/c4 | [https://huggingface.co/datasets/allenai/c4](https://huggingface.co/datasets/allenai/c4) |
60
+ | Pile | ./datasets/pile-val-backup | [https://huggingface.co/datasets/mit-han-lab/pile-val-backup](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) |
61
+
62
+ **Commonsense QA evaluation**
63
+
64
+ For QA evaluation, we use local config files to specify the paths to local datasets. First, copy the dataset config files under `~/anaconda3/envs/flatquant/lib/python3.10/site-packages/lm_eval/tasks` to `./datasets/lm_eval_configs/tasks`. Next, modify the config item `dataset_path` in each QA dataset's config file to the local directory listed in the following table.
65
+
66
+ | Dataset | Local Dir | URL |
67
+ | --------------- | ------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
68
+ | ARC-E and ARC-C | ./datasets/ai2_arc | [https://huggingface.co/datasets/ai2_arc](https://huggingface.co/datasets/ai2_arc) |
69
+ | HellaSwag | ./datasets/hellaswag | [https://huggingface.co/datasets/hellaswag](https://huggingface.co/datasets/hellaswag) |
70
+ | LAMBADA | ./datasets/lambada_openai | [https://huggingface.co/datasets/EleutherAI/lambada_openai](https://huggingface.co/datasets/EleutherAI/lambada_openai) |
71
+ | PIQA | ./datasets/piqa | [https://huggingface.co/datasets/ybisk/piqa](https://huggingface.co/datasets/ybisk/piqa) |
72
+ | WinoGrande | ./datasets/winogrande | [https://huggingface.co/datasets/winogrande](https://huggingface.co/datasets/winogrande) |
73
+
74
+ ### Model Preparation
75
+
76
+ Download models in `./modelzoo`.
77
+
78
+ | Model | Local Dir | URL |
79
+ | ----------- | ------------------------------ | -------------------------------------------------------------------------------------------------------- |
80
+ | LLaMA-2-7B | ./modelzoo/llama-2/llama-2-7b | [https://huggingface.co/meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) |
81
+ | LLaMA-2-13B | ./modelzoo/llama-2/llama-2-13b | [https://huggingface.co/meta-llama/Llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b) |
82
+ | LLaMA-2-70B | ./modelzoo/llama-2/llama-2-70b | [https://huggingface.co/meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) |
83
+ | LLaMA-3-8B | ./modelzoo/llama-3/llama-3-8b | [https://huggingface.co/meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) |
84
+ | LLaMA-3-70B | ./modelzoo/llama-3/llama-3-70b | [https://huggingface.co/meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) |
85
+
86
+ ## Usage
87
+
88
+ ### Calibration
89
+
90
+ We provide full script to run FlatQuant in `./scripts/`. We use LLaMa-3-8B as an example here:
91
+
92
+ 1. Weight-Activation-KV Cache Quantization
93
+
94
+ ```bash
95
+ # W4A4KV4
96
+ python ./main.py \
97
+ --model ./modelzoo/llama-3/llama-3-8b \
98
+ --w_bits 4 --a_bits 4 \
99
+ --k_bits 4 --k_asym --k_groupsize 128 \
100
+ --v_bits 4 --v_asym --v_groupsize 128 \
101
+ --cali_bsz 4 --epoch 15 --flat_lr 5e-3 \
102
+ --lwc --lac --cali_trans --add_diag \
103
+ --output_dir ./outputs --save_matrix \
104
+ --lm_eval --lm_eval_batch_size 16
105
+ ```
106
+
107
+ 2. Weight-Only Quantization
108
+
109
+ ```bash
110
+ # W4A16
111
+ python ./main.py \
112
+ --model ./modelzoo/llama-3/llama-3-8b \
113
+ --w_bits 4 \
114
+ --cali_bsz 4 --epoch 15 --flat_lr 5e-3 \
115
+ --lwc --lac --cali_trans --add_diag \
116
+ --output_dir ./outputs --exp_name wonly --save_matrix \
117
+ --lm_eval --lm_eval_batch_size 16
118
+ ```
119
+
120
+ 3. Reproduce Evaluation Results of Our Paper
121
+
122
+ 1\) Download the pretrained FlatQuant parameters you want through [modelzoo](#model-zoo).
123
+
124
+ 2\) Inference with `--reload_matrix` and `--matrix_path PATH_TO_XXX`, take LLaMa-3-8B with W4A4KV4 quantization as an example:
125
+
126
+ ```bash
127
+ python ./main.py \
128
+ --model ./modelzoo/llama-3/llama-3-8b \
129
+ --w_bits 4 --a_bits 4 \
130
+ --k_bits 4 --k_asym --k_groupsize 128 \
131
+ --v_bits 4 --v_asym --v_groupsize 128 \
132
+ --cali_bsz 4 --epoch 15 --flat_lr 5e-3 \
133
+ --lwc --lac --cali_trans --add_diag \
134
+ --output_dir ./outputs --save_matrix \
135
+ --lm_eval --lm_eval_batch_size 16 \
136
+ --reload_matrix --matrix_path PATH_TO_XXX
137
+ ```
138
+
139
+
140
+ #### DeepSeek V3/R1 Scripts
141
+ For scripts related to **DeepSeek V3/R1**, see [`scripts/deepseek/`](./scripts/deepseek). Make sure to use the package versions specified in the DeepSeek V3 repository.
142
+
143
+ **Note**: We observed that the **last two layers of DeepSeek V3/R1** are difficult to quantize effectively.
144
+ Use the flag `--v3_not_last` to skip calibration for these layers during quantization.
145
+
146
+ ### Inference Latency
147
+
148
+ To measure the speedup of FlatQuant and our efficient kernel, run the corresponding benchmark commands provided below:
149
+
150
+ ```bash
151
+ # Run end-to-end latency benchmark
152
+ python ./benchmarks/layer_benchmark.py
153
+ ```
154
+
155
+ ```bash
156
+ # Run kernel latency benchmark
157
+ python ./benchmarks/kernel_benchmark.py
158
+ ```
159
+
160
+ ```bash
161
+ # Run linear layer latency benchmark
162
+ python ./benchmarks/qlinear_benchmark.py
163
+ ```
164
+
165
+ ```bash
166
+ # Run attention latency benchmark
167
+ python ./benchmarks/qattention_benchmark.py
168
+ ```
169
+
170
+ ### Apply to other models
171
+
172
+ To apply FlatQuant in your own models, some modifications are required in the forward pass of the model, particularly within the Attention and MLP modules. You can refer to [flatquant/model_tools](flatquant/model_tools) for our implementations of LLaMA2, LLaMA3, LLaMA3.1, and Qwen2.5.
173
+
174
+ ### Efficient Kernel
175
+
176
+ The detailed implementation of our efficient kernel can be found in [deploy/kernels/kron_matmul.py](deploy/kernels/kron_matmul.py) and [deploy/kernels/block_matmul.py](deploy/kernels/block_matmul.py).
177
+
178
+ ### Plot Flatness
179
+
180
+ Run the following command to plot the flatness of weights and activations after different pre-quantization transformations including FlatQuant, Hdamard transformation and per-channel scaling. We use LLaMa-3-8B as an example here. The flag `--matrix_path` is used to specify the path to the pre-trained transformation matrices of FlatQuant.
181
+
182
+ ```
183
+ python ./plot_flatness.py \
184
+ --model ./modelzoo/llama-3/llama-3-8b \
185
+ --distribute_model --add_diag \
186
+ --matrix_path ./modelzoo/flatquant/llama-3-8b/w4a4
187
+ ```
188
+
189
+ ![flatness](figures/flatness.jpg)
190
+
191
+ ### Fake Quantization with vLLM
192
+
193
+ To enable fake-quantized inference with FlatQuant in a custom vLLM build, register the quantized models in your `inference.py` by adding:
194
+
195
+ ```python
196
+ from vllm_custom.model_executor.fake_quantized_models.registry import register_fake_quantized_models
197
+ register_fake_quantized_models() # Register fake-quantized models in vLLM
198
+ ```
199
+
200
+ All code related to fake quantization is located in **`vllm_custom/`**.
201
+
202
+ ## Model Zoo
203
+
204
+ We provide the pre-trained transformation matrices of FlatQuant at [https://huggingface.co/ruikangliu/FlatQuant](https://huggingface.co/ruikangliu/FlatQuant). The supported models are listed in the following table. For detailed implementations of each model, please refer to the code in `./flatquant/model_tools`.
205
+
206
+ | Model | W4A4KV4 | W4A16KV16 |
207
+ | ----------------- | ---------------- | --------- |
208
+ | LLaMa-2 | ✅ 7B / 13B / 70B | |
209
+ | LLaMa-3 | ✅ 8B / 70B | ✅ 8B |
210
+ | Qwen-2.5-Instruct | ✅ 7B / 32B | |
211
+ | DeepSeek | ✅ V3 / R1 | |
212
+
213
+ ## Results
214
+
215
+ ### Accuracy Results
216
+
217
+ **Table 1: WikiText-2 perplexity of 4-bit weight & acitvation quantized LLaMA models.**
218
+
219
+ | **Method** | **W Quantizer** | **2-7B** | **2-13B** | **2-70B** | **3-8B** | **3-70B** |
220
+ | ------------- | --------------- | -------- | --------- | --------- | -------- | --------- |
221
+ | FP16 | - | 5.47 | 4.88 | 3.32 | 6.14 | 2.86 |
222
+ | SmoothQuant | RTN | 83.12 | 35.88 | 26.01 | 210.19 | 9.60 |
223
+ | OmniQuant | RTN | 14.74 | 12.28 | - | - | - |
224
+ | AffineQuant | RTN | 12.69 | 11.45 | - | - | - |
225
+ | QuaRot | RTN | 8.56 | 6.10 | 4.14 | 10.60 | 55.44 |
226
+ | SpinQuant | RTN | 6.14 | 5.44 | 3.82 | 7.96 | 7.58 |
227
+ | **FlatQuant** | RTN | **5.79** | **5.12** | **3.55** | **6.98** | **3.78** |
228
+ | QUIK-4B | GPTQ | 8.87 | 7.78 | 6.91 | - | - |
229
+ | QuaRot | GPTQ | 6.10 | 5.40 | 3.79 | 8.16 | 6.60 |
230
+ | SpinQuant | GPTQ | 5.96 | 5.24 | 3.70 | 7.39 | 6.21 |
231
+ | **FlatQuant** | GPTQ | **5.78** | **5.11** | **3.54** | **6.90** | **3.77** |
232
+
233
+ **Table 2: C4 perplexity of 4-bit weight & acitvation quantized LLaMA models.**
234
+
235
+ | **Method** | **W Quantizer** | **2-7B** | **2-13B** | **2-70B** | **3-8B** | **3-70B** |
236
+ | ------------- | --------------- | -------- | --------- | --------- | --------- | --------- |
237
+ | FP16 | - | 7.26 | 6.73 | 5.71 | 9.45 | 7.17 |
238
+ | SmoothQuant | RTN | 77.27 | 43.19 | 34.61 | 187.93 | 16.90 |
239
+ | OmniQuant | RTN | 21.40 | 16.24 | - | - | - |
240
+ | AffineQuant | RTN | 15.76 | 13.97 | - | - | - |
241
+ | QuaRot | RTN | 11.86 | 8.67 | 6.42 | 17.19 | 79.48 |
242
+ | SpinQuant | RTN | 9.19 | 8.11 | 6.26 | 13.45 | 15.39 |
243
+ | **FlatQuant** | RTN | **7.79** | **7.09** | **5.91** | **11.13** | **7.86** |
244
+ | QUIK-4B | GPTQ | - | - | - | - | - |
245
+ | QuaRot | GPTQ | 8.32 | 7.54 | 6.12 | 13.38 | 12.87 |
246
+ | SpinQuant | GPTQ | 8.28 | 7.48 | 6.07 | 12.19 | 12.82 |
247
+ | **FlatQuant** | GPTQ | **7.86** | **7.11** | **5.92** | **11.21** | **7.93** |
248
+
249
+ **Table 3: Zero-shot QA task results of 4-bit weight & activation quantized LLaMA models.**
250
+
251
+ | **Method** | **W Quantizer** | **2-7B** | **2-13B** | **2-70B** | **3-8B** | **3-70B** |
252
+ | ------------- | --------------- | --------- | --------- | --------- | --------- | --------- |
253
+ | FP16 | - | 69.79 | 72.55 | 77.05 | 73.23 | 79.95 |
254
+ | QuaRot | RTN | 57.73 | 66.25 | 73.47 | 61.34 | 35.36 |
255
+ | SpinQuant | RTN | 63.52 | 68.56 | 75.09 | 66.98 | 65.66 |
256
+ | **FlatQuant** | RTN | **67.96** | **71.42** | **76.62** | **71.23** | **79.01** |
257
+ | QuaRot | GPTQ | 65.01 | 68.91 | 75.68 | 65.79 | 70.45 |
258
+ | SpinQuant | GPTQ | 66.23 | 70.93 | 76.06 | 68.70 | 71.66 |
259
+ | **FlatQuant** | GPTQ | **67.47** | **71.64** | **76.53** | **71.33** | **78.58** |
260
+
261
+ **Table 4: Results of 4-bit weight & activation quantized Qwen-2.5-Instruct models.**
262
+
263
+ | **Method** | **W Quantizer** | **7B PPL (WikiText-2 / C4)** | **7B QA Avg.** | **32B PPL (WikiText-2 / C4)** | **32B QA Avg.** |
264
+ | ------------- | --------------- | ---------------------------- | -------------- | ----------------------------- | --------------- |
265
+ | BF16 | - | 8.36 / 14.37 | 70.75 | 5.32 / 10.45 | 75.10 |
266
+ | QuaRot | RTN | - | - | 6.95 / 12.17 | 70.24 |
267
+ | QuaRot | GPTQ | - | - | 6.54 / 11.65 | 72.25 |
268
+ | **FlatQuant** | RTN | **8.46** / **13.94** | **68.62** | **5.80 / 10.86** | **74.89** |
269
+
270
+ **Table 5: Results of 4-bit weight & activation quantized DeepSeek V3/R1 models.**
271
+
272
+ | **Model** | **Quantization** | **C-Eval** | **MMLU** | **AIME2024** |
273
+ | ---------------- | ---------------- | ---------- | -------- | ------------ |
274
+ | DeepSeek V3-Base | FP8 | 90.10 | 87.10 | - |
275
+ | | W4A4 | 89.59 | 86.32 | - |
276
+ | DeepSeek R1 | FP8 | - | - | 79.8 |
277
+ | | W4A4 | - | - | 73.3 |
278
+
279
+ ### Latency Results
280
+
281
+ **Table 6: Prefill speedup of LLaMA-2-7B model across different batch sizes on one RTX3090 GPU. We decode 256 tokens after the prefill on a sequence length of 2048.**
282
+
283
+ | **Batch Size** | **Int4** | **QuaRot** | **FlatQuant** |
284
+ | -------------- | -------- | ---------- | ------------- |
285
+ | 1 | 2.17 | 1.97 | 2.12 |
286
+ | 2 | 2.21 | 1.99 | 2.16 |
287
+ | 4 | 2.25 | 2.04 | 2.21 |
288
+ | 8 | 2.28 | 2.05 | 2.23 |
289
+ | 16 | 2.32 | 2.08 | 2.27 |
290
+ | 32 | 2.35 | 2.09 | 2.28 |
291
+ | 64 | 2.37 | 2.11 | 2.30 |
292
+
293
+ **Table 7: Decoding speedup of LLaMA-2-7B model across different batch sizes on one RTX3090 GPU. We decode 256 tokens after the prefill on a sequence length of 2048.**
294
+
295
+ | **Batch Size** | **Int4** | **QuaRot** | **FlatQuant** |
296
+ | -------------- | -------- | ---------- | ------------- |
297
+ | 1 | 0.81 | 0.70 | 0.71 |
298
+ | 2 | 0.78 | 0.66 | 0.69 |
299
+ | 4 | 0.82 | 0.74 | 0.73 |
300
+ | 8 | 0.97 | 0.83 | 0.83 |
301
+ | 16 | 1.18 | 1.01 | 1.05 |
302
+ | 32 | 1.50 | 1.38 | 1.43 |
303
+ | 64 | 1.83 | 1.75 | 1.76 |
304
+
305
+ ## Acknowledgements
306
+
307
+ This project is based on the work of the following projects:
308
+
309
+ - [QuaRot](https://github.com/spcl/QuaRot)
310
+ - [OmniQuant](https://github.com/OpenGVLab/OmniQuant)
311
+ - [IntactKV](https://github.com/ruikangliu/IntactKV)
312
+
313
+ We are grateful for the contributions provided by these projects.
314
+
315
+ ## References
316
+
317
+ If you find FlatQuant helpful, please cite our paper:
318
+
319
+ ```
320
+ @article{sun2024flatquant,
321
+ title={FlatQuant: Flatness Matters for LLM Quantization},
322
+ author={Sun, Yuxuan and Liu, Ruikang and Bai, Haoli and Bao, Han and Zhao, Kang and Li, Yuening and Hu, Jiaxin and Yu, Xianzhi and Hou, Lu and Yuan, Chun and others},
323
+ journal={arXiv preprint arXiv:2410.09426},
324
+ year={2024}
325
+ }
326
+ ```