Kimi-Linear-48B-Instruct-GGUF

Kimi Linear: An Expressive, Efficient Attention Architecture

I am currently looking for open positions! 🤗 If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.

Experimental Build Required 🚧 This model utilizes the Kimi Delta Attention (KDA) architecture, which is not yet supported in the main branch of llama.cpp.

To run this GGUF, you must compile llama.cpp from PR #18381. Attempting to run this on a standard build will result in errors.

Some test prompts :)

Description

This repository contains experimental GGUF format model files for Moonshot AI's Kimi Linear 48B.

Kimi Linear is a hybrid linear attention architecture designed to outperform traditional full attention methods in long-context and scaling regimes. It uses Kimi Delta Attention (KDA) and a hybrid architecture (3:1 KDA-to-MLA ratio) to reduce memory usage and boost throughput by up to 6x on long sequences.

Performance & Architecture. This model is currently quantized to Q2_K (and others) to fit on consumer hardware while testing the architecture's correctness. Despite the aggressive quantization, initial tests show the logic and reasoning capabilities remain intact.

Feature	Kimi Linear Specification
Architecture	Hybrid Linear Attention (MoE + MLA + KDA)
Context Length	1M Tokens (Supported by architecture)
Params	48B Total / 3B Activated
Throughput	~6.3x faster TPOT compared to MLA at 1M context
MMLU-Pro	51.0 (4k context)
RULER	84.3 (128k context, Pareto-optimal)

How to Run (llama.cpp)

Prerequisite: You must clone and build the specific PR branch:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/18381/head:kimi-linear-support
git checkout kimi-linear-support
make -j

1. CLI Inference (Interactive Chat)

./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
  -n 2048 \                  # Adjust generation limit
  -c 8192 \                  # Context window (Model supports up to 1M)
  --temp 0.8 \               # Recommended temperature
  --top-p 0.9 \
  -ngl 99 \                  # Offload all layers to GPU
  -p "<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n" \
  -cnv

Note: The current GGUF implementation successfully mitigates previous "state collapse" issues found in early development.

2. Server Mode (API)

Running a persistent server is recommended for this size model to avoid reloading times.

./llama-server -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192 \
  --alias kimi

Hardware Requirements

Full GPU Offloading (-ngl 99):
- Q4_K_M: Requires ~28GB VRAM (e.g., A100, A6000, or Mac Studio M2/M3 Max).
- Q2_K: Requires ~16-18GB VRAM (Fits on RTX 3090 / 4090).
Split Offloading:
- If you have less VRAM (e.g., 12GB), use -ngl with a lower number (e.g., -ngl 20) to split layers between GPU and CPU RAM.

Default Settings

temperature: 0.8
top-p: 0.9
repeat-penalty: 1.05 (Optional, if repetition occurs)

CLI Example

./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
  -c 8192 \
  --temp 0.8 \
  --top-p 0.9 \
  -p "<|im_start|>user\nWrite a Python script to calculate Fibonacci numbers.<|im_end|>\n<|im_start|>assistant\n" \
  -cnv

Downloads last month: 2,187

GGUF

Model size

49B params

Architecture

kimi-linear

Hardware compatibility

2-bit

4-bit

8-bit

View +1 variant

Model tree for AaryanK/Kimi-Linear-48B-A3B-Instruct-GGUF

Base model

moonshotai/Kimi-Linear-48B-A3B-Instruct

Quantized

(21)

this model