Llama 3 8B - Pruned 30% + Mixed-Precision Quantization (GGUF)
Model Description
This model is a pruned and mixed-precision quantized version of Llama 3 8B:
- Base Model: meta-llama/Meta-Llama-3-8B
- Pruning: 30% Taylor pruning (from naveedashfaq/llama-3-8b-pruned-30-percent-taylor)
- Quantization: Custom mixed-precision using llama.cpp
--tensor-type
Quantization Details
| Metric | Value |
|---|---|
| Average Bitwidth | 8.25 bpw |
| File Size | 6.10 GB |
| Compression vs F16 | 48.4% |
Mixed-Precision Strategy
Different tensors are quantized at different precisions based on sensitivity:
| Tensor Type | Quantization | Reason |
|---|---|---|
| token_embd, output | Q8_0 | Critical for quality |
| attn_v (layers 0-2, 30-31) | Q8_0 | Most sensitive layers |
| attn_v (other layers) | Q6_K | High sensitivity |
| attn_output | Q8_0 | Important for attention |
| attn_q, attn_k | Q5_K | Medium sensitivity |
| ffn_up | Q5_K | Medium sensitivity |
| ffn_gate | Q4_K | Robust to quantization |
| ffn_down | F16 | Fallback (non-256 dims from pruning) |
Quantization Type Distribution
Q5_K: 96 tensors (31.34% params) - 5.50 bpw
Q8_0: 39 tensors (25.37% params) - 8.50 bpw
F16: 32 tensors (20.75% params) - 16.00 bpw
Q4_K: 32 tensors (20.75% params) - 4.50 bpw
Q6_K: 27 tensors (1.79% params) - 6.56 bpw
F32: 66 tensors (norms) - 32.00 bpw
Usage
# With llama.cpp
./llama-cli -m llama-3-8b-pruned-mixed-precision.gguf -p "Hello, I am"
# With llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="llama-3-8b-pruned-mixed-precision.gguf")
Files
mixed_precision_REAL.gguf- The quantized model (6.10 GB)mixed_precision_config.json- Sensitivity analysis configquantization_summary.json- Quantization summary
Method
- Started with 30% Taylor-pruned Llama 3 8B
- Converted to F16 GGUF
- Extracted importance matrix (imatrix) for calibration
- Applied sensitivity-based mixed-precision using
--tensor-typeflag - Validated different quantization types per tensor
Citation
@misc{llama3-pruned-mixed-precision,
author = {Muhammad Ahmad},
title = {Llama 3 8B Pruned Mixed-Precision GGUF},
year = {2025},
publisher = {HuggingFace},
}
- Downloads last month
- 151
Hardware compatibility
Log In
to view the estimation
We're not able to determine the quantization variants.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for ArslanRobo/llama-3.1-8b-pruned-taylor30-padded-mixed-precision-quantization-gguf
Base model
meta-llama/Meta-Llama-3-8B