Qwen3-8B-4bit-SINQ

This is a 4-bit quantized version of the Qwen/Qwen3-8B model using SINQ (Sinkhorn-Normalized Quantization).

Model Details

Base Model: Qwen/Qwen3-8B
Quantization Method: SINQ
Bit-width: 4-bit
Group Size: 128
Tiling Mode: 1D
Compute Dtype: bfloat16
Compression Ratio: ~4x
Expected Memory Reduction: ~75%
Quantized on: October 6, 2025
Hardware: NVIDIA A100 (80GB)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "avinashhm/Qwen3-8B-4bit-SINQ",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "avinashhm/Qwen3-8B-4bit-SINQ",
    trust_remote_code=True
)

prompt = "Describe the future of artificial intelligence."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Notes

Quantized using the SINQ library: https://github.com/huawei-csl/SINQ
Suitable for inference on GPUs with at least 16GB VRAM (e.g., NVIDIA T4, A100).
Ensure transformers is up-to-date: pip install git+https://github.com/huggingface/transformers.git
Pushed to Hugging Face Hub on October 6, 2025, at 10:33 PM IST.