--- license: apache-2.0 base_model: Qwen/Qwen3-Reranker-0.6B base_model_relation: quantized tags: - gguf - quantized - llama.cpp - text-ranking model_type: qwen3 quantized_by: Jonathan Middleton revision: 602838d # Aug 19 2025 --- # Qwen3-Reranker-0.6B-GGUF **🚨 REQUIRED Llama.cpp build:** https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank **This unmerged fix branch is mandatory** to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline `llama.cpp` because they were not produced with this build. **This quantization was produced with the above build and works.** ## Purpose Multilingual **text-reranking** model in **GGUF** for efficient CPU/GPU inference with *llama.cpp*-compatible back-ends. Parameters ≈ **0.6 B**. **Note:** Token embedding matrix and output tensors are **left at FP16** across all quantizations. ## Files | Filename | Quant | Size (bytes / MiB) | Est. quality Δ vs FP16 | |--------------------------------------------|---------|------------------------------------|------------------------| | `Qwen3-Reranker-0.6B-F16.gguf` | FP16 | 1,197,634,048 B (1142.2 MiB) | 0 (reference) | | `Qwen3-Reranker-0.6B-Q4_K_M.gguf` | Q4_K_M | 396,476,032 B (378.1 MiB) | TBD | | `Qwen3-Reranker-0.6B-Q5_K_M.gguf` | Q5_K_M | 444,186,496 B (423.6 MiB) | TBD | | `Qwen3-Reranker-0.6B-Q6_K.gguf` | Q6_K | 494,878,880 B (472.0 MiB) | TBD | | `Qwen3-Reranker-0.6B-Q8_0.gguf` | Q8_0 | 639,153,088 B (609.5 MiB) | TBD | ## Upstream Source * **Repo:** `Qwen/Qwen3-Reranker-0.6B` * **Commit:** `f16fc5d` (2025-06-09) * **License:** Apache-2.0 ## Conversion & Quantization ```bash # Convert safetensors → GGUF (FP16) python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B # Quantize variants EMB_OPT="--token-embedding-type F16 --leave-output-tensor" for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT done