Qwen3-8B-4bit-SINQ / README.md
avinashhm's picture
Push quantized Qwen3-8B 4-bit SINQ model with README
d2627ab verified

Qwen3-8B-4bit-SINQ

This is a 4-bit quantized version of the Qwen/Qwen3-8B model using SINQ (Sinkhorn-Normalized Quantization).

Model Details

  • Base Model: Qwen/Qwen3-8B
  • Quantization Method: SINQ
  • Bit-width: 4-bit
  • Group Size: 128
  • Tiling Mode: 1D
  • Compute Dtype: bfloat16
  • Compression Ratio: ~4x
  • Expected Memory Reduction: ~75%
  • Quantized on: October 6, 2025
  • Hardware: NVIDIA A100 (80GB)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "avinashhm/Qwen3-8B-4bit-SINQ",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "avinashhm/Qwen3-8B-4bit-SINQ",
    trust_remote_code=True
)

prompt = "Describe the future of artificial intelligence."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Notes

  • Quantized using the SINQ library: https://github.com/huawei-csl/SINQ
  • Suitable for inference on GPUs with at least 16GB VRAM (e.g., NVIDIA T4, A100).
  • Ensure transformers is up-to-date: pip install git+https://github.com/huggingface/transformers.git
  • Pushed to Hugging Face Hub on October 6, 2025, at 10:33 PM IST.