Papers
arxiv:2404.14047

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Published on Apr 22, 2024
· Submitted by AK on Apr 23, 2024
#2 Paper of the day
Authors:
,
,

Abstract

LLaMA3 performance degrades significantly when quantized to low bit-width, highlighting challenges in low-bit quantization for large language models.

AI-generated summary

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMA3 models are released in https://huggingface.co/LLMQ.

Community

I don't get why they used quip instead of quip#, it has been around since quite a while.

·
Paper author

Thanks for your attention and kind reminder! Due to time constraints, some quantization methods could not be evaluated completely prior to this preprint. We have never forgotten them, more work is on the way! 😊

They should of cause also include the imatrix method of llama.cpp. how can they miss that.

for 2 bit mistral instruct IQ2_XS, compare fp16 with the imatrix numbers for mmlu hellaswag

https://github.com/ggerganov/llama.cpp/discussions/5263

image.png

we're talking about a 7b model here.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Great work!
May I ask which library did you use to get w8a16 awq quantization? As far as i know, AutoAWQ and llm-awq only support 4 bit quantization.

How Effective Are Low-bit Quantized LLaMA3 Models? An Empirical Analysis

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.14047 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.14047 in a Space README.md to link it from this page.

Collections including this paper 16