Kimi-Linear-48B-Instruct-GGUF
Kimi Linear: An Expressive, Efficient Attention Architecture
I am currently looking for open positions! ๐ค If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.
Experimental Build Required ๐ง This model utilizes the Kimi Delta Attention (KDA) architecture, which is not yet supported in the main branch of
llama.cpp.To run this GGUF, you must compile
llama.cppfrom PR #18381. Attempting to run this on a standard build will result in errors.
Some test prompts :)
Description
This repository contains experimental GGUF format model files for Moonshot AI's Kimi Linear 48B.
Kimi Linear is a hybrid linear attention architecture designed to outperform traditional full attention methods in long-context and scaling regimes. It uses Kimi Delta Attention (KDA) and a hybrid architecture (3:1 KDA-to-MLA ratio) to reduce memory usage and boost throughput by up to 6x on long sequences.
Performance & Architecture. This model is currently quantized to Q2_K (and others) to fit on consumer hardware while testing the architecture's correctness. Despite the aggressive quantization, initial tests show the logic and reasoning capabilities remain intact.
| Feature | Kimi Linear Specification |
|---|---|
| Architecture | Hybrid Linear Attention (MoE + MLA + KDA) |
| Context Length | 1M Tokens (Supported by architecture) |
| Params | 48B Total / 3B Activated |
| Throughput | ~6.3x faster TPOT compared to MLA at 1M context |
| MMLU-Pro | 51.0 (4k context) |
| RULER | 84.3 (128k context, Pareto-optimal) |
How to Run (llama.cpp)
Prerequisite: You must clone and build the specific PR branch:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/18381/head:kimi-linear-support
git checkout kimi-linear-support
make -j
1. CLI Inference (Interactive Chat)
./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
-n 2048 \ # Adjust generation limit
-c 8192 \ # Context window (Model supports up to 1M)
--temp 0.8 \ # Recommended temperature
--top-p 0.9 \
-ngl 99 \ # Offload all layers to GPU
-p "<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n" \
-cnv
Note: The current GGUF implementation successfully mitigates previous "state collapse" issues found in early development.
2. Server Mode (API)
Running a persistent server is recommended for this size model to avoid reloading times.
./llama-server -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
--port 8080 \
-ngl 99 \
-c 8192 \
--alias kimi
Hardware Requirements
- Full GPU Offloading (
-ngl 99):- Q4_K_M: Requires ~28GB VRAM (e.g., A100, A6000, or Mac Studio M2/M3 Max).
- Q2_K: Requires ~16-18GB VRAM (Fits on RTX 3090 / 4090).
- Split Offloading:
- If you have less VRAM (e.g., 12GB), use
-nglwith a lower number (e.g.,-ngl 20) to split layers between GPU and CPU RAM.
- If you have less VRAM (e.g., 12GB), use
Default Settings
- temperature:
0.8 - top-p:
0.9 - repeat-penalty:
1.05(Optional, if repetition occurs)
CLI Example
./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
-c 8192 \
--temp 0.8 \
--top-p 0.9 \
-p "<|im_start|>user\nWrite a Python script to calculate Fibonacci numbers.<|im_end|>\n<|im_start|>assistant\n" \
-cnv
- Downloads last month
- 2,187
Model tree for AaryanK/Kimi-Linear-48B-A3B-Instruct-GGUF
Base model
moonshotai/Kimi-Linear-48B-A3B-Instruct