Qwen3-8B-4bit-SINQ
This is a 4-bit quantized version of the Qwen/Qwen3-8B model using SINQ (Sinkhorn-Normalized Quantization).
Model Details
- Base Model: Qwen/Qwen3-8B
- Quantization Method: SINQ
- Bit-width: 4-bit
- Group Size: 128
- Tiling Mode: 1D
- Compute Dtype: bfloat16
- Compression Ratio: ~4x
- Expected Memory Reduction: ~75%
- Quantized on: October 6, 2025
- Hardware: NVIDIA A100 (80GB)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"avinashhm/Qwen3-8B-4bit-SINQ",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"avinashhm/Qwen3-8B-4bit-SINQ",
trust_remote_code=True
)
prompt = "Describe the future of artificial intelligence."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Notes
- Quantized using the SINQ library: https://github.com/huawei-csl/SINQ
- Suitable for inference on GPUs with at least 16GB VRAM (e.g., NVIDIA T4, A100).
- Ensure
transformersis up-to-date:pip install git+https://github.com/huggingface/transformers.git - Pushed to Hugging Face Hub on October 6, 2025, at 10:33 PM IST.