Update README.md
Browse files
README.md
CHANGED
|
@@ -35,7 +35,8 @@ In order to use the current quantized model, support is offered for different so
|
|
| 35 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
| 36 |
|
| 37 |
```bash
|
| 38 |
-
pip install -q --upgrade transformers
|
|
|
|
| 39 |
```
|
| 40 |
|
| 41 |
To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
|
|
@@ -81,7 +82,8 @@ print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_spe
|
|
| 81 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
| 82 |
|
| 83 |
```bash
|
| 84 |
-
pip install -q --upgrade transformers
|
|
|
|
| 85 |
```
|
| 86 |
|
| 87 |
Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
|
|
@@ -119,24 +121,18 @@ The AutoAWQ script has been adapted from [`AutoAWQ/examples/generate.py`](https:
|
|
| 119 |
|
| 120 |
### 🤗 Text Generation Inference (TGI)
|
| 121 |
|
| 122 |
-
To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/))
|
| 123 |
-
|
| 124 |
-
```bash
|
| 125 |
-
pip install -q --upgrade huggingface_hub
|
| 126 |
-
huggingface-cli login
|
| 127 |
-
```
|
| 128 |
|
| 129 |
-
Then you just need to run the TGI v2.
|
| 130 |
|
| 131 |
```bash
|
| 132 |
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
| 133 |
-v hf_cache:/data \
|
| 134 |
-e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
|
| 135 |
-e QUANTIZE=awq \
|
| 136 |
-
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
|
| 137 |
-e MAX_INPUT_LENGTH=4000 \
|
| 138 |
-e MAX_TOTAL_TOKENS=4096 \
|
| 139 |
-
ghcr.io/huggingface/text-generation-inference:2.
|
| 140 |
```
|
| 141 |
|
| 142 |
> [!NOTE]
|
|
@@ -166,7 +162,7 @@ Or programatically via the `huggingface_hub` Python client as follows:
|
|
| 166 |
import os
|
| 167 |
from huggingface_hub import InferenceClient
|
| 168 |
|
| 169 |
-
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=
|
| 170 |
|
| 171 |
chat_completion = client.chat.completions.create(
|
| 172 |
model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
|
|
@@ -183,7 +179,7 @@ Alternatively, the OpenAI Python client can also be used (see [installation note
|
|
| 183 |
import os
|
| 184 |
from openai import OpenAI
|
| 185 |
|
| 186 |
-
client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=
|
| 187 |
|
| 188 |
chat_completion = client.chat.completions.create(
|
| 189 |
model="tgi",
|
|
@@ -243,16 +239,25 @@ chat_completion = client.chat.completions.create(
|
|
| 243 |
|
| 244 |
## Quantization Reproduction
|
| 245 |
|
| 246 |
-
> [!
|
| 247 |
> In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
|
|
|
|
|
|
|
| 248 |
|
| 249 |
In order to quantize Gemma2 9B Instruct, first install the following packages:
|
| 250 |
|
| 251 |
```bash
|
| 252 |
-
pip install -q --upgrade "torch==2.3.0" transformers accelerate
|
| 253 |
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
| 254 |
```
|
| 255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
|
| 257 |
|
| 258 |
```python
|
|
|
|
| 35 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
| 36 |
|
| 37 |
```bash
|
| 38 |
+
pip install -q --upgrade "transformers>=4.45.0" accelerate
|
| 39 |
+
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
| 40 |
```
|
| 41 |
|
| 42 |
To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
|
|
|
|
| 82 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
| 83 |
|
| 84 |
```bash
|
| 85 |
+
pip install -q --upgrade "transformers>=4.45.0" accelerate
|
| 86 |
+
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
| 87 |
```
|
| 88 |
|
| 89 |
Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
|
|
|
|
| 121 |
|
| 122 |
### 🤗 Text Generation Inference (TGI)
|
| 123 |
|
| 124 |
+
To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
Then you just need to run the TGI v2.3.0 (or higher) Docker container as follows:
|
| 127 |
|
| 128 |
```bash
|
| 129 |
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
| 130 |
-v hf_cache:/data \
|
| 131 |
-e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
|
| 132 |
-e QUANTIZE=awq \
|
|
|
|
| 133 |
-e MAX_INPUT_LENGTH=4000 \
|
| 134 |
-e MAX_TOTAL_TOKENS=4096 \
|
| 135 |
+
ghcr.io/huggingface/text-generation-inference:2.3.0
|
| 136 |
```
|
| 137 |
|
| 138 |
> [!NOTE]
|
|
|
|
| 162 |
import os
|
| 163 |
from huggingface_hub import InferenceClient
|
| 164 |
|
| 165 |
+
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key="-")
|
| 166 |
|
| 167 |
chat_completion = client.chat.completions.create(
|
| 168 |
model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
|
|
|
|
| 179 |
import os
|
| 180 |
from openai import OpenAI
|
| 181 |
|
| 182 |
+
client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key="-")
|
| 183 |
|
| 184 |
chat_completion = client.chat.completions.create(
|
| 185 |
model="tgi",
|
|
|
|
| 239 |
|
| 240 |
## Quantization Reproduction
|
| 241 |
|
| 242 |
+
> [!IMPORTANT]
|
| 243 |
> In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
|
| 244 |
+
>
|
| 245 |
+
> Additionally, you also need to accept the Gemma2 access conditions, as it is a gated model that requires accepting those first.
|
| 246 |
|
| 247 |
In order to quantize Gemma2 9B Instruct, first install the following packages:
|
| 248 |
|
| 249 |
```bash
|
| 250 |
+
pip install -q --upgrade "torch==2.3.0" "transformers>=4.45.0" accelerate
|
| 251 |
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
| 252 |
```
|
| 253 |
|
| 254 |
+
Then you need to install the `huggingface_hub` Python SDK and login to the Hugging Face Hub.
|
| 255 |
+
|
| 256 |
+
```bash
|
| 257 |
+
pip install -q --upgrade huggingface_hub
|
| 258 |
+
huggingface-cli login
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
|
| 262 |
|
| 263 |
```python
|