Falcon3-1B-Instruct-1.58bit-vlut-gguf

This repository contains state-of-the-art ternary-packed versions of Falcon3-1B-Instruct-1.58bit in GGUF format, optimized for efficient on-device inference using the Vec-LUT method.

Key Features

  • ๐ŸŽฏ SOTA Compression: Achieves BPW (bits per weight) as low as 1.60 through lossless sub-2-bit ternary packing.
  • โšก SOTA Performance: Delivers superior throughput (4.2x speedup) in parallel inference scenarios via vector lookup table (LUT).
  • ๐Ÿ”Œ Drop-in Ready: Seamless integration with vlut.cpp for immediate deployment on edge devices.

Available Model Variants

Models are named as ggml-model-{PACKING}_{TILE}.gguf:

File Name Packing (BPW) Tile Size Comment
ggml-model-I1_V.gguf I1_V (1.60) 1
ggml-model-I1_V_2.gguf I1_V (1.60) 2 Recommended
ggml-model-I2_V.gguf I2_V (2.00) 1
ggml-model-I2_V_4.gguf I2_V (2.00) 4 Recommended
ggml-model-I2_V_8.gguf I2_V (2.00) 8

Selection Guide

  • BPW vs. Speed: I1_V achieves lower memory usage but may not always outperform I2_V in speed.
  • Tiling Trade-off: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.
  • Starting Point: Use I1_V_2 or I2_V_4 as a starting point.

For detailed tiling parameter analysis, see Evaluation.md and the paper.

Usage

Prerequisites

Install vlut.cpp (these models require vlut.cpp, not vanilla llama.cpp):

git clone https://github.com/Cipherxzc/vlut.cpp.git
cd vlut.cpp
cmake -B build && cmake --build build --config Release -j4

Download & Run

# Download the recommended variant, e.g., I2_V_4
hf download <repo_id> \
  ggml-model-I2_V_4.gguf --local-dir ./models

# Run parallel inference
./build/bin/llama-batched \
  -m ./models/ggml-model-I2_V_4.gguf \
  -p "I believe the meaning of life is" \
  -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5

# Benchmark performance
./build/bin/llama-bench \
  -m ./models/ggml-model-I2_V_4.gguf \
  -t 1 -p 128 -n 0

For comprehensive usage instructions, refer to the vlut.cpp Quick Start Guide.

Citation

If you use these models, please cite our paper:

@article{li2025veclut,
  title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},
  author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},
  journal={arXiv preprint arXiv:2512.06443},
  year={2025},
  url={https://arxiv.org/abs/2512.06443}
}

And the original Falcon3 work:

@misc{Falcon3,
    title = {The Falcon 3 family of Open Models},
    author = {TII Team},
    month = {December},
    year = {2024}
}
Downloads last month
86
GGUF
Model size
2B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for XXXXyu/Falcon3-1B-Instruct-1.58bit-vlut-gguf

Quantized
(5)
this model

Collection including XXXXyu/Falcon3-1B-Instruct-1.58bit-vlut-gguf