ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-14B

This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants. Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

These quants provide best in class quality for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community here and on r/LocalLLaMA for tips and tricks helping each other run all the fun new models!

Excited to share and learn together. Thanks!

Quant Collection

So far these are my best recipes offering the great quality in good memory footprint breakpoints.

Qwen3-14B-IQ4_KS

8.454 GiB (4.917 BPW)

  • type f32: 161 tensors - norms etc.
  • type iq6_k: 2 tensors - token_embd/output
  • type iq4_ks: 80 tensors - ffn_(gate|up)
  • type iq5_ks: 200 tensors - ffn_down and all attn_*

This quant is designed to take advantage of faster iq4_ks and new iq5_ks quants.

This quant is designed for full GPU offload of 32k context (unquantized f16 kv-cache) in < 16GB VRAM (nvidia-smi reports ~13856MiB VRAM usage). Shrinking the attn tensors improves token generation performance over full Q8_0 as shown in llama-sweep-bench speed benchmarking.

Quantization

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

# token_embd.weight,         torch.bfloat16 --> BF16, shape = {5120, 151936}
#
# blk.28.ffn_down.weight,    torch.bfloat16 --> BF16, shape = {17408, 5120}
# blk.28.ffn_gate.weight,    torch.bfloat16 --> BF16, shape = {5120, 17408}
# blk.28.ffn_up.weight,      torch.bfloat16 --> BF16, shape = {5120, 17408}
#
# blk.28.attn_output.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_q.weight,      torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_k.weight,      torch.bfloat16 --> BF16, shape = {5120, 1024}
# blk.28.attn_v.weight,      torch.bfloat16 --> BF16, shape = {5120, 1024}
#
# blk.28.attn_norm.weight,   torch.bfloat16 --> F32, shape = {5120}
# blk.28.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {5120}
# blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
# blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
#
# output_norm.weight,        torch.bfloat16 --> F32, shape = {5120}
# output.weight,             torch.bfloat16 --> BF16, shape = {5120, 151936}

custom="
# Attention
blk\.[0-9]\.attn_.*\.weight=iq5_ks
blk\.[1-3][0-9]\.attn_.*\.weight=iq5_ks

# FFN
blk\.[0-9]\.ffn_down\.weight=iq5_ks
blk\.[1-3][0-9]\.ffn_down\.weight=iq5_ks

blk\.[0-9]\.ffn_(gate|up)\.weight=iq4_ks
blk\.[1-3][0-9]\.ffn_(gate|up)\.weight=iq4_ks

# Token embedding/output
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-imatrix.dat \
    --custom-q "$custom" \
    /mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
    /mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf \
    IQ4_KS \
    16

Methodology

Full methdology and some benchmarks available in this Quant Cookers Basic Guide

llama-sweep-bench chart

References

Downloads last month
30
GGUF
Model size
15B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ubergarm/Qwen3-14B-GGUF

Finetuned
Qwen/Qwen3-14B
Quantized
(126)
this model