# Qwen3-8B-4bit-SINQ This is a 4-bit quantized version of the `Qwen/Qwen3-8B` model using SINQ (Sinkhorn-Normalized Quantization). ## Model Details - **Base Model**: Qwen/Qwen3-8B - **Quantization Method**: SINQ - **Bit-width**: 4-bit - **Group Size**: 128 - **Tiling Mode**: 1D - **Compute Dtype**: bfloat16 - **Compression Ratio**: ~4x - **Expected Memory Reduction**: ~75% - **Quantized on**: October 6, 2025 - **Hardware**: NVIDIA A100 (80GB) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "avinashhm/Qwen3-8B-4bit-SINQ", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "avinashhm/Qwen3-8B-4bit-SINQ", trust_remote_code=True ) prompt = "Describe the future of artificial intelligence." inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Notes - Quantized using the SINQ library: https://github.com/huawei-csl/SINQ - Suitable for inference on GPUs with at least 16GB VRAM (e.g., NVIDIA T4, A100). - Ensure `transformers` is up-to-date: `pip install git+https://github.com/huggingface/transformers.git` - Pushed to Hugging Face Hub on October 6, 2025, at 10:33 PM IST.