vlut.cpp
Collection
SOTA ternary-packed versions of 1.58-bit LLMs for efficient on-device inference with vlut.cpp.
โข
3 items
โข
Updated
โข
1
This repository contains state-of-the-art ternary-packed versions of Falcon3-1B-Instruct-1.58bit in GGUF format, optimized for efficient on-device inference using the Vec-LUT method.
Models are named as ggml-model-{PACKING}_{TILE}.gguf:
| File Name | Packing (BPW) | Tile Size | Comment |
|---|---|---|---|
ggml-model-I1_V.gguf |
I1_V (1.60) |
1 | |
ggml-model-I1_V_2.gguf |
I1_V (1.60) |
2 | Recommended |
ggml-model-I2_V.gguf |
I2_V (2.00) |
1 | |
ggml-model-I2_V_4.gguf |
I2_V (2.00) |
4 | Recommended |
ggml-model-I2_V_8.gguf |
I2_V (2.00) |
8 |
I1_V achieves lower memory usage but may not always outperform I2_V in speed.I1_V_2 or I2_V_4 as a starting point.For detailed tiling parameter analysis, see Evaluation.md and the paper.
Install vlut.cpp (these models require vlut.cpp, not vanilla llama.cpp):
git clone https://github.com/Cipherxzc/vlut.cpp.git
cd vlut.cpp
cmake -B build && cmake --build build --config Release -j4
# Download the recommended variant, e.g., I2_V_4
hf download <repo_id> \
ggml-model-I2_V_4.gguf --local-dir ./models
# Run parallel inference
./build/bin/llama-batched \
-m ./models/ggml-model-I2_V_4.gguf \
-p "I believe the meaning of life is" \
-np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5
# Benchmark performance
./build/bin/llama-bench \
-m ./models/ggml-model-I2_V_4.gguf \
-t 1 -p 128 -n 0
For comprehensive usage instructions, refer to the vlut.cpp Quick Start Guide.
If you use these models, please cite our paper:
@article{li2025veclut,
title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
journal={arXiv preprint arXiv:2512.06443},
year={2025},
url={https://arxiv.org/abs/2512.06443}
}
And the original Falcon3 work:
@misc{Falcon3,
title = {The Falcon 3 family of Open Models},
author = {TII Team},
month = {December},
year = {2024}
}
We're not able to determine the quantization variants.
Base model
tiiuae/Falcon3-1B-Base