hugging-quants
/

gemma-2-9b-it-AWQ-INT4

@@ -35,7 +35,8 @@ In order to use the current quantized model, support is offered for different so
 In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
 ```bash
-pip install -q --upgrade transformers autoawq accelerate
 ```
 To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
@@ -81,7 +82,8 @@ print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_spe
 In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
 ```bash
-pip install -q --upgrade transformers autoawq accelerate
 ```
 Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
@@ -119,24 +121,18 @@ The AutoAWQ script has been adapted from [`AutoAWQ/examples/generate.py`](https:
 ### 🤗 Text Generation Inference (TGI)
-To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
-```bash
-pip install -q --upgrade huggingface_hub
-huggingface-cli login
-```
-Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
 ```bash
 docker run --gpus all --shm-size 1g -ti -p 8080:80 \
   -v hf_cache:/data \
   -e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
   -e QUANTIZE=awq \
-  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
   -e MAX_INPUT_LENGTH=4000 \
   -e MAX_TOTAL_TOKENS=4096 \
-  ghcr.io/huggingface/text-generation-inference:2.2.0
 ```
 > [!NOTE]
@@ -166,7 +162,7 @@ Or programatically via the `huggingface_hub` Python client as follows:
 import os
 from huggingface_hub import InferenceClient
-client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
 chat_completion = client.chat.completions.create(
   model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
@@ -183,7 +179,7 @@ Alternatively, the OpenAI Python client can also be used (see [installation note
 import os
 from openai import OpenAI
-client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
 chat_completion = client.chat.completions.create(
   model="tgi",
@@ -243,16 +239,25 @@ chat_completion = client.chat.completions.create(
 ## Quantization Reproduction
-> [!NOTE]
 > In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
 In order to quantize Gemma2 9B Instruct, first install the following packages:
 ```bash
-pip install -q --upgrade "torch==2.3.0" transformers accelerate
 INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
 ```
 Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
 ```python

 In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
 ```bash
+pip install -q --upgrade "transformers>=4.45.0" accelerate
+INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
 ```
 To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
 In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
 ```bash
+pip install -q --upgrade "transformers>=4.45.0" accelerate
+INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
 ```
 Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
 ### 🤗 Text Generation Inference (TGI)
+To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)).
+Then you just need to run the TGI v2.3.0 (or higher) Docker container as follows:
 ```bash
 docker run --gpus all --shm-size 1g -ti -p 8080:80 \
   -v hf_cache:/data \
   -e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
   -e QUANTIZE=awq \
   -e MAX_INPUT_LENGTH=4000 \
   -e MAX_TOTAL_TOKENS=4096 \
+  ghcr.io/huggingface/text-generation-inference:2.3.0
 ```
 > [!NOTE]
 import os
 from huggingface_hub import InferenceClient
+client = InferenceClient(base_url="http://0.0.0.0:8080", api_key="-")
 chat_completion = client.chat.completions.create(
   model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
 import os
 from openai import OpenAI
+client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key="-")
 chat_completion = client.chat.completions.create(
   model="tgi",
 ## Quantization Reproduction
+> [!IMPORTANT]
 > In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
+>
+> Additionally, you also need to accept the Gemma2 access conditions, as it is a gated model that requires accepting those first.
 In order to quantize Gemma2 9B Instruct, first install the following packages:
 ```bash
+pip install -q --upgrade "torch==2.3.0" "transformers>=4.45.0" accelerate
 INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
 ```
+Then you need to install the `huggingface_hub` Python SDK and login to the Hugging Face Hub.
+```bash
+pip install -q --upgrade huggingface_hub
+huggingface-cli login
+```
 Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
 ```python