Initial GPTQ model commit
Browse files
README.md
CHANGED
|
@@ -8,7 +8,7 @@ language:
|
|
| 8 |
license: llama2
|
| 9 |
model_creator: OpenAssistant
|
| 10 |
model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
|
| 11 |
-
model_name: CodeLlama 13B
|
| 12 |
model_type: llama
|
| 13 |
quantized_by: TheBloke
|
| 14 |
---
|
|
@@ -30,23 +30,28 @@ quantized_by: TheBloke
|
|
| 30 |
<hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
|
| 31 |
<!-- header end -->
|
| 32 |
|
| 33 |
-
# CodeLlama 13B
|
| 34 |
- Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
|
| 35 |
-
- Original model: [CodeLlama 13B
|
| 36 |
|
|
|
|
| 37 |
## Description
|
| 38 |
|
| 39 |
-
This repo contains GPTQ model files for [OpenAssistant's CodeLlama 13B
|
| 40 |
|
| 41 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
| 42 |
|
|
|
|
|
|
|
| 43 |
## Repositories available
|
| 44 |
|
| 45 |
-
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-
|
| 46 |
-
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-
|
| 47 |
-
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-
|
| 48 |
* [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
|
|
|
|
| 49 |
|
|
|
|
| 50 |
## Prompt template: ChatML
|
| 51 |
|
| 52 |
```
|
|
@@ -58,6 +63,9 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
| 58 |
|
| 59 |
```
|
| 60 |
|
|
|
|
|
|
|
|
|
|
| 61 |
## Provided files and GPTQ parameters
|
| 62 |
|
| 63 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
|
@@ -71,7 +79,7 @@ All GPTQ files are made with AutoGPTQ.
|
|
| 71 |
|
| 72 |
- Bits: The bit size of the quantised model.
|
| 73 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
| 74 |
-
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
|
| 75 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
| 76 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
| 77 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
|
@@ -81,87 +89,89 @@ All GPTQ files are made with AutoGPTQ.
|
|
| 81 |
|
| 82 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
| 83 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
|
|
|
|
|
|
| 90 |
|
|
|
|
| 91 |
## How to download from branches
|
| 92 |
|
| 93 |
-
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/CodeLlama-13B-
|
| 94 |
- With Git, you can clone a branch with:
|
| 95 |
```
|
| 96 |
-
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/CodeLlama-13B-
|
| 97 |
```
|
| 98 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
| 99 |
-
|
|
|
|
| 100 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
| 101 |
|
| 102 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
| 103 |
|
| 104 |
-
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
| 105 |
|
| 106 |
1. Click the **Model tab**.
|
| 107 |
-
2. Under **Download custom model or LoRA**, enter `TheBloke/CodeLlama-13B-
|
| 108 |
-
- To download from a specific branch, enter for example `TheBloke/CodeLlama-13B-
|
| 109 |
- see Provided Files above for the list of branches for each option.
|
| 110 |
3. Click **Download**.
|
| 111 |
-
4. The model will start downloading. Once it's finished it will say "Done"
|
| 112 |
5. In the top left, click the refresh icon next to **Model**.
|
| 113 |
-
6. In the **Model** dropdown, choose the model you just downloaded: `CodeLlama-13B-
|
| 114 |
7. The model will automatically load, and is now ready for use!
|
| 115 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
| 116 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
| 117 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
|
|
|
| 118 |
|
|
|
|
| 119 |
## How to use this GPTQ model from Python code
|
| 120 |
|
| 121 |
-
|
| 122 |
|
| 123 |
-
|
| 124 |
-
pip3 install auto-gptq
|
| 125 |
-
```
|
| 126 |
|
| 127 |
-
|
|
|
|
|
|
|
| 128 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
pip3 uninstall -y auto-gptq
|
| 130 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
| 131 |
cd AutoGPTQ
|
| 132 |
pip3 install .
|
| 133 |
```
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
```python
|
| 138 |
-
from transformers import AutoTokenizer, pipeline, logging
|
| 139 |
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
|
|
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
|
|
|
| 153 |
|
| 154 |
-
|
| 155 |
-
# To download from a specific branch, use the revision parameter, as in this example:
|
| 156 |
-
# Note that `revision` requires AutoGPTQ 0.3.1 or later!
|
| 157 |
-
|
| 158 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
| 159 |
-
revision="gptq-4bit-32g-actorder_True",
|
| 160 |
-
use_safetensors=True,
|
| 161 |
-
trust_remote_code=False,
|
| 162 |
-
device="cuda:0",
|
| 163 |
-
quantize_config=None)
|
| 164 |
-
"""
|
| 165 |
|
| 166 |
prompt = "Tell me about AI"
|
| 167 |
prompt_template=f'''<|im_start|>system
|
|
@@ -180,9 +190,6 @@ print(tokenizer.decode(output[0]))
|
|
| 180 |
|
| 181 |
# Inference can also be done using transformers' pipeline
|
| 182 |
|
| 183 |
-
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
| 184 |
-
logging.set_verbosity(logging.CRITICAL)
|
| 185 |
-
|
| 186 |
print("*** Pipeline:")
|
| 187 |
pipe = pipeline(
|
| 188 |
"text-generation",
|
|
@@ -196,12 +203,17 @@ pipe = pipeline(
|
|
| 196 |
|
| 197 |
print(pipe(prompt_template)[0]['generated_text'])
|
| 198 |
```
|
|
|
|
| 199 |
|
|
|
|
| 200 |
## Compatibility
|
| 201 |
|
| 202 |
-
The files provided
|
|
|
|
|
|
|
| 203 |
|
| 204 |
-
|
|
|
|
| 205 |
|
| 206 |
<!-- footer start -->
|
| 207 |
<!-- 200823 -->
|
|
@@ -235,7 +247,7 @@ And thank you again to a16z for their generous grant.
|
|
| 235 |
|
| 236 |
<!-- footer end -->
|
| 237 |
|
| 238 |
-
# Original model card: OpenAssistant's CodeLlama 13B
|
| 239 |
|
| 240 |
# Open-Assistant CodeLlama 13B SFT v10
|
| 241 |
|
|
|
|
| 8 |
license: llama2
|
| 9 |
model_creator: OpenAssistant
|
| 10 |
model_link: https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10
|
| 11 |
+
model_name: CodeLlama 13B SFT v10
|
| 12 |
model_type: llama
|
| 13 |
quantized_by: TheBloke
|
| 14 |
---
|
|
|
|
| 30 |
<hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
|
| 31 |
<!-- header end -->
|
| 32 |
|
| 33 |
+
# CodeLlama 13B SFT v10 - GPTQ
|
| 34 |
- Model creator: [OpenAssistant](https://huggingface.co/OpenAssistant)
|
| 35 |
+
- Original model: [CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
|
| 36 |
|
| 37 |
+
<!-- description start -->
|
| 38 |
## Description
|
| 39 |
|
| 40 |
+
This repo contains GPTQ model files for [OpenAssistant's CodeLlama 13B SFT v10](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10).
|
| 41 |
|
| 42 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
| 43 |
|
| 44 |
+
<!-- description end -->
|
| 45 |
+
<!-- repositories-available start -->
|
| 46 |
## Repositories available
|
| 47 |
|
| 48 |
+
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ)
|
| 49 |
+
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGUF)
|
| 50 |
+
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GGML)
|
| 51 |
* [OpenAssistant's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10)
|
| 52 |
+
<!-- repositories-available end -->
|
| 53 |
|
| 54 |
+
<!-- prompt-template start -->
|
| 55 |
## Prompt template: ChatML
|
| 56 |
|
| 57 |
```
|
|
|
|
| 63 |
|
| 64 |
```
|
| 65 |
|
| 66 |
+
<!-- prompt-template end -->
|
| 67 |
+
|
| 68 |
+
<!-- README_GPTQ.md-provided-files start -->
|
| 69 |
## Provided files and GPTQ parameters
|
| 70 |
|
| 71 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
|
|
|
| 79 |
|
| 80 |
- Bits: The bit size of the quantised model.
|
| 81 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
| 82 |
+
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
|
| 83 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
| 84 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
| 85 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
|
|
|
| 89 |
|
| 90 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
| 91 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
| 92 |
+
| main | 4 | 128 | No | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
| 93 |
+
| gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
| 94 |
+
| gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
| 95 |
+
| gptq-4bit-128g-actorder_True | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
| 96 |
+
| gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
|
| 97 |
+
| gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) | 8192 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
|
| 98 |
+
|
| 99 |
+
<!-- README_GPTQ.md-provided-files end -->
|
| 100 |
|
| 101 |
+
<!-- README_GPTQ.md-download-from-branches start -->
|
| 102 |
## How to download from branches
|
| 103 |
|
| 104 |
+
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ:gptq-4bit-32g-actorder_True`
|
| 105 |
- With Git, you can clone a branch with:
|
| 106 |
```
|
| 107 |
+
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ
|
| 108 |
```
|
| 109 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
| 110 |
+
<!-- README_GPTQ.md-download-from-branches end -->
|
| 111 |
+
<!-- README_GPTQ.md-text-generation-webui start -->
|
| 112 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
| 113 |
|
| 114 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
| 115 |
|
| 116 |
+
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
| 117 |
|
| 118 |
1. Click the **Model tab**.
|
| 119 |
+
2. Under **Download custom model or LoRA**, enter `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ`.
|
| 120 |
+
- To download from a specific branch, enter for example `TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ:gptq-4bit-32g-actorder_True`
|
| 121 |
- see Provided Files above for the list of branches for each option.
|
| 122 |
3. Click **Download**.
|
| 123 |
+
4. The model will start downloading. Once it's finished it will say "Done".
|
| 124 |
5. In the top left, click the refresh icon next to **Model**.
|
| 125 |
+
6. In the **Model** dropdown, choose the model you just downloaded: `CodeLlama-13B-OASST-SFT-v10-GPTQ`
|
| 126 |
7. The model will automatically load, and is now ready for use!
|
| 127 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
| 128 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
| 129 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
| 130 |
+
<!-- README_GPTQ.md-text-generation-webui end -->
|
| 131 |
|
| 132 |
+
<!-- README_GPTQ.md-use-from-python start -->
|
| 133 |
## How to use this GPTQ model from Python code
|
| 134 |
|
| 135 |
+
### Install the necessary packages
|
| 136 |
|
| 137 |
+
Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
|
|
|
|
|
|
|
| 138 |
|
| 139 |
+
```shell
|
| 140 |
+
pip3 install transformers>=4.32.0 optimum>=1.12.0
|
| 141 |
+
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
| 142 |
```
|
| 143 |
+
|
| 144 |
+
If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
|
| 145 |
+
|
| 146 |
+
```shell
|
| 147 |
pip3 uninstall -y auto-gptq
|
| 148 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
| 149 |
cd AutoGPTQ
|
| 150 |
pip3 install .
|
| 151 |
```
|
| 152 |
|
| 153 |
+
### For CodeLlama models only: you must use Transformers 4.33.0 or later.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
|
| 156 |
+
```shell
|
| 157 |
+
pip3 uninstall -y transformers
|
| 158 |
+
pip3 install git+https://github.com/huggingface/transformers.git
|
| 159 |
+
```
|
| 160 |
|
| 161 |
+
### You can then use the following code
|
| 162 |
|
| 163 |
+
```python
|
| 164 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
| 165 |
|
| 166 |
+
model_name_or_path = "TheBloke/CodeLlama-13B-OASST-SFT-v10-GPTQ"
|
| 167 |
+
# To use a different branch, change revision
|
| 168 |
+
# For example: revision="gptq-4bit-32g-actorder_True"
|
| 169 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
| 170 |
+
torch_dtype=torch.bfloat16,
|
| 171 |
+
device_map="auto",
|
| 172 |
+
revision="main")
|
| 173 |
|
| 174 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
prompt = "Tell me about AI"
|
| 177 |
prompt_template=f'''<|im_start|>system
|
|
|
|
| 190 |
|
| 191 |
# Inference can also be done using transformers' pipeline
|
| 192 |
|
|
|
|
|
|
|
|
|
|
| 193 |
print("*** Pipeline:")
|
| 194 |
pipe = pipeline(
|
| 195 |
"text-generation",
|
|
|
|
| 203 |
|
| 204 |
print(pipe(prompt_template)[0]['generated_text'])
|
| 205 |
```
|
| 206 |
+
<!-- README_GPTQ.md-use-from-python end -->
|
| 207 |
|
| 208 |
+
<!-- README_GPTQ.md-compatibility start -->
|
| 209 |
## Compatibility
|
| 210 |
|
| 211 |
+
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
|
| 212 |
+
|
| 213 |
+
[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
|
| 214 |
|
| 215 |
+
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
|
| 216 |
+
<!-- README_GPTQ.md-compatibility end -->
|
| 217 |
|
| 218 |
<!-- footer start -->
|
| 219 |
<!-- 200823 -->
|
|
|
|
| 247 |
|
| 248 |
<!-- footer end -->
|
| 249 |
|
| 250 |
+
# Original model card: OpenAssistant's CodeLlama 13B SFT v10
|
| 251 |
|
| 252 |
# Open-Assistant CodeLlama 13B SFT v10
|
| 253 |
|