# Qwen3-8B-4bit-SINQ

This is a 4-bit quantized version of the `Qwen/Qwen3-8B` model using SINQ (Sinkhorn-Normalized Quantization).

## Model Details
- **Base Model**: Qwen/Qwen3-8B
- **Quantization Method**: SINQ
- **Bit-width**: 4-bit
- **Group Size**: 128
- **Tiling Mode**: 1D
- **Compute Dtype**: bfloat16
- **Compression Ratio**: ~4x
- **Expected Memory Reduction**: ~75%
- **Quantized on**: October 6, 2025
- **Hardware**: NVIDIA A100 (80GB)

## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "avinashhm/Qwen3-8B-4bit-SINQ",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "avinashhm/Qwen3-8B-4bit-SINQ",
    trust_remote_code=True
)

prompt = "Describe the future of artificial intelligence."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Notes
- Quantized using the SINQ library: https://github.com/huawei-csl/SINQ
- Suitable for inference on GPUs with at least 16GB VRAM (e.g., NVIDIA T4, A100).
- Ensure `transformers` is up-to-date: `pip install git+https://github.com/huggingface/transformers.git`
- Pushed to Hugging Face Hub on October 6, 2025, at 10:33 PM IST.