--- library_name: transformers license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507/blob/main/LICENSE pipeline_tag: text-generation tags: - AWQ - 量化修复 - vLLM base_model: - Kwaipilot/KAT-V1-40B base_model_relation: quantized --- # KAT-V1-40B-AWQ Base model: [Kwaipilot/KAT-V1-40B](https://huggingface.co/Kwaipilot/KAT-V1-40B) ### 【vLLM Single Node with 4 GPUs Startup Command】 ``` CONTEXT_LENGTH=32768 vllm serve \ QuantTrio/KAT-V1-40B-AWQ \ --served-model-name KAT-V1-40B-AWQ \ --swap-space 16 \ --max-num-seqs 512 \ --max-model-len $CONTEXT_LENGTH \ --max-seq-len-to-capture $CONTEXT_LENGTH \ --gpu-memory-utilization 0.9 \ --tensor-parallel-size 4 \ --trust-remote-code \ --disable-log-requests \ --host 0.0.0.0 \ --port 8000 ``` ### 【Dependencies】 ``` vllm==0.10.0 ``` ### 【Model Update Date】 ``` 2025-07-31 1. fast commit ``` ### 【Model Files】 | File Size | Last Updated | |--------|--------------| | `22GB` | `2025-07-31` | ### 【Model Download】 ```python from huggingface_hub import snapshot_download snapshot_download('QuantTrio/KAT-V1-40B-AWQ', cache_dir="your_local_path") ``` ### 【Overview】
Kwaipilot

Hugging Face arXiv
# News - Kwaipilot-AutoThink ranks first among all open-source models on [LiveCodeBench Pro](https://livecodebenchpro.com/), a challenging benchmark explicitly designed to prevent data leakage, and even surpasses strong proprietary systems such as Seed and o3-mini. *** # Introduction **KAT (Kwaipilot-AutoThink)** is an open-source large-language model that mitigates *over-thinking* by learning **when** to produce explicit chain-of-thought and **when** to answer directly. ![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61ee40a269351366e29972ad%2FzdnsvBmv6hWIC2Qxxy1fD.png) Its development follows a concise two-stage training pipeline:
Stage Core Idea Key Techniques Outcome
1. Pre-training Inject knowledge while separating “reasoning” from “direct answering”. Dual-regime data
Think-off queries labeled via a custom tagging system.
Think-on queries generated by a multi-agent solver.

Knowledge Distillation + Multi-Token Prediction for fine-grained utility.
Base model attains strong factual and reasoning skills without full-scale pre-training costs.
2. Post-training Make reasoning optional and efficient. Cold-start AutoThink — majority vote sets the initial thinking mode.
Step-SRPO — intermediate supervision rewards correct mode selection and answer accuracy under that mode.
Model triggers CoT only when beneficial, reducing token use and speeding inference.
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61ee40a269351366e29972ad%2FcwFAEh7Rl3f4FU46z8gBZ.png) *** # Data Format KAT produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable. Two modes are supported: ![image/jpeg](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61ee40a269351366e29972ad%2FH8iAvQMMT02nyvlYnI5q1.jpeg) ## Special Tokens | Token | Description | |-------|-------------| | `` | Analyzes the input to decide whether explicit reasoning is needed. | | `` / `` | Indicates whether reasoning is **activated** (“on”) or **skipped** (“off”). | | `` | Marks the start of the chain-of-thought segment when `think_on` is chosen. | | `` | Marks the start of the final user-facing answer. | *** # 🔧 Quick Start ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "Kwaipilot/KAT-V1-40B" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=65536, temperature=0.6, top_p=0.95, ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n") print("prompt:\n", prompt) print("content:\n", content) """ prompt: Give me a short introduction to large language model. content: The user's request is to provide a concise factual introduction to large language models, which involves retrieving and summarizing basic information. This task is straightforward as it only requires recalling and presenting well-known details without deeper analysis. No complex reasoning is needed here—just a simple explanation will suffice. A **Large Language Model (LLM)** is an advanced AI system trained on vast amounts of text data to understand, generate, and process human-like language. Here’s a concise introduction: ### Key Points: 1. **Training**: Trained on diverse text sources (books, websites, etc.) using deep learning. 2. **Capabilities**: - Answer questions, generate text, summarize content, translate languages. - Understand context, sentiment, and nuances in language. 3. **Architecture**: Often based on **transformer models** (e.g., BERT, GPT, LLaMA). 4. **Scale**: Billions of parameters, requiring massive computational resources. 5. **Applications**: Chatbots, content creation, coding assistance, research, and more. ### Examples: - **OpenAI’s GPT-4**: Powers ChatGPT. - **Google’s Gemini**: Used in Bard. - **Meta’s LLaMA**: Open-source alternative. ### Challenges: - **Bias**: Can reflect biases in training data. - **Accuracy**: May hallucinate "facts" not grounded in reality. - **Ethics**: Raises concerns about misinformation and job displacement. LLMs represent a leap forward in natural language processing, enabling machines to interact with humans in increasingly sophisticated ways. 🌐🤖 """ ``` *** # Future Releases Looking ahead, we will publish a companion paper that fully documents the **AutoThink training framework**, covering: * Cold-start initialization procedures * Reinforcement-learning (Step-SRPO) strategies * Data curation and reward design details At the same time, we will open-source: * **Training resources** – the curated dual-regime datasets and RL codebase * **Model suite** – checkpoints at 1.5B, 7B, and 13B parameters, all trained with AutoThink gating