Kimi-Linear-48B-Instruct-GGUF

Kimi Linear: An Expressive, Efficient Attention Architecture

I am currently looking for open positions! ๐Ÿค— If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.

Experimental Build Required ๐Ÿšง This model utilizes the Kimi Delta Attention (KDA) architecture, which is not yet supported in the main branch of llama.cpp.

To run this GGUF, you must compile llama.cpp from PR #18381. Attempting to run this on a standard build will result in errors.

Some test prompts :)

Description

This repository contains experimental GGUF format model files for Moonshot AI's Kimi Linear 48B.

Kimi Linear is a hybrid linear attention architecture designed to outperform traditional full attention methods in long-context and scaling regimes. It uses Kimi Delta Attention (KDA) and a hybrid architecture (3:1 KDA-to-MLA ratio) to reduce memory usage and boost throughput by up to 6x on long sequences.

Performance & Architecture. This model is currently quantized to Q2_K (and others) to fit on consumer hardware while testing the architecture's correctness. Despite the aggressive quantization, initial tests show the logic and reasoning capabilities remain intact.

Feature Kimi Linear Specification
Architecture Hybrid Linear Attention (MoE + MLA + KDA)
Context Length 1M Tokens (Supported by architecture)
Params 48B Total / 3B Activated
Throughput ~6.3x faster TPOT compared to MLA at 1M context
MMLU-Pro 51.0 (4k context)
RULER 84.3 (128k context, Pareto-optimal)

How to Run (llama.cpp)

Prerequisite: You must clone and build the specific PR branch:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/18381/head:kimi-linear-support
git checkout kimi-linear-support
make -j

1. CLI Inference (Interactive Chat)

./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
  -n 2048 \                  # Adjust generation limit
  -c 8192 \                  # Context window (Model supports up to 1M)
  --temp 0.8 \               # Recommended temperature
  --top-p 0.9 \
  -ngl 99 \                  # Offload all layers to GPU
  -p "<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n" \
  -cnv

Note: The current GGUF implementation successfully mitigates previous "state collapse" issues found in early development.

2. Server Mode (API)

Running a persistent server is recommended for this size model to avoid reloading times.

./llama-server -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192 \
  --alias kimi

Hardware Requirements

  • Full GPU Offloading (-ngl 99):
    • Q4_K_M: Requires ~28GB VRAM (e.g., A100, A6000, or Mac Studio M2/M3 Max).
    • Q2_K: Requires ~16-18GB VRAM (Fits on RTX 3090 / 4090).
  • Split Offloading:
    • If you have less VRAM (e.g., 12GB), use -ngl with a lower number (e.g., -ngl 20) to split layers between GPU and CPU RAM.

Default Settings

  • temperature: 0.8
  • top-p: 0.9
  • repeat-penalty: 1.05 (Optional, if repetition occurs)

CLI Example

./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
  -c 8192 \
  --temp 0.8 \
  --top-p 0.9 \
  -p "<|im_start|>user\nWrite a Python script to calculate Fibonacci numbers.<|im_end|>\n<|im_start|>assistant\n" \
  -cnv
Downloads last month
2,187
GGUF
Model size
49B params
Architecture
kimi-linear
Hardware compatibility
Log In to view the estimation

2-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AaryanK/Kimi-Linear-48B-A3B-Instruct-GGUF

Quantized
(21)
this model