Llama 3 8B - Pruned 30% + Mixed-Precision Quantization (GGUF)

Model Description

This model is a pruned and mixed-precision quantized version of Llama 3 8B:

  • Base Model: meta-llama/Meta-Llama-3-8B
  • Pruning: 30% Taylor pruning (from naveedashfaq/llama-3-8b-pruned-30-percent-taylor)
  • Quantization: Custom mixed-precision using llama.cpp --tensor-type

Quantization Details

Metric Value
Average Bitwidth 8.25 bpw
File Size 6.10 GB
Compression vs F16 48.4%

Mixed-Precision Strategy

Different tensors are quantized at different precisions based on sensitivity:

Tensor Type Quantization Reason
token_embd, output Q8_0 Critical for quality
attn_v (layers 0-2, 30-31) Q8_0 Most sensitive layers
attn_v (other layers) Q6_K High sensitivity
attn_output Q8_0 Important for attention
attn_q, attn_k Q5_K Medium sensitivity
ffn_up Q5_K Medium sensitivity
ffn_gate Q4_K Robust to quantization
ffn_down F16 Fallback (non-256 dims from pruning)

Quantization Type Distribution

Q5_K:  96 tensors (31.34% params) - 5.50 bpw
Q8_0:  39 tensors (25.37% params) - 8.50 bpw
F16:   32 tensors (20.75% params) - 16.00 bpw
Q4_K:  32 tensors (20.75% params) - 4.50 bpw
Q6_K:  27 tensors (1.79% params)  - 6.56 bpw
F32:   66 tensors (norms)         - 32.00 bpw

Usage

# With llama.cpp
./llama-cli -m llama-3-8b-pruned-mixed-precision.gguf -p "Hello, I am"

# With llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="llama-3-8b-pruned-mixed-precision.gguf")

Files

  • mixed_precision_REAL.gguf - The quantized model (6.10 GB)
  • mixed_precision_config.json - Sensitivity analysis config
  • quantization_summary.json - Quantization summary

Method

  1. Started with 30% Taylor-pruned Llama 3 8B
  2. Converted to F16 GGUF
  3. Extracted importance matrix (imatrix) for calibration
  4. Applied sensitivity-based mixed-precision using --tensor-type flag
  5. Validated different quantization types per tensor

Citation

@misc{llama3-pruned-mixed-precision,
  author = {Muhammad Ahmad},
  title = {Llama 3 8B Pruned Mixed-Precision GGUF},
  year = {2025},
  publisher = {HuggingFace},
}
Downloads last month
151
GGUF
Model size
6B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ArslanRobo/llama-3.1-8b-pruned-taylor30-padded-mixed-precision-quantization-gguf

Quantized
(271)
this model