stablelm-tuned-alpha-3b-gptq-4bit-128g
This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone this repo and load locally.
git lfs install
git clone https://huggingface.co/ethzanalytics/stablelm-tuned-alpha-3b-gptq-4bit-128g
See the below excerpt from the tutorial for instructions.
Auto-GPTQ Quick Start
Quick Installation
Start from v0.0.4, one can install auto-gptq directly from pypi using pip:
pip install auto-gptq
AutoGPTQ supports using triton to speedup inference, but it currently only supports Linux. To integrate triton, using:
pip install auto-gptq[triton]
For some people who want to try the newly supported llama type models in 🤗 Transformers but not update it to the latest version, using:
pip install auto-gptq[llama]
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
To disable building CUDA extension, you can use the following commands:
For Linux
BUILD_CUDA_EXT=0 pip install auto-gptq
For Windows
set BUILD_CUDA_EXT=0 && pip install auto-gptq
Basic Usage
The full script of basic usage demonstrated here is examples/quantization/basic_usage.py
The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM and BaseQuantizeConfig.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
Load quantized model and do inference
Instead of .from_pretrained, you should use .from_quantized to load a quantized model.
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
This will first read and load quantize_config.json in opt-125m-4bit-128g directory, then based on the values of bits and group_size in it, load gptq_model-4bit-128g.bin model file into the first GPU.
Then you can initialize 🤗 Transformers' TextGenerationPipeline and do inference.
from transformers import TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
Conclusion
Congrats! You learned how to quickly install auto-gptq and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.
- Downloads last month
- 2