File size: 6,199 Bytes
c739199
 
 
 
 
 
 
 
c684883
c739199
9a357f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c739199
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: apache-2.0
base_model:
- BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-Fp32
tags:
- 4Bit
- MLX
- MXFP4
library_name: mlx
---
# BasedBase-Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 - MLX MXFP4 Quantization

A massive and gentlemanly thank you to the original author **[BasedBase](https://huggingface.co/BasedBase)** for creating this incredible model. This is a MXFP4 quantized version of the original [Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-Fp32](https://huggingface.co/BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2-Fp32) model, optimized for Apple Silicon with MLX.

All of my additions and modifications are detailed below. The original, highly-detailed model card from `BasedBase` can be found further down this page.

---

## My Contributions & Modifications

### MLX Quantization

This version of the model has been quantized to **MXFP4 precision** using the MLX framework, making it incredibly efficient to run on Apple Silicon devices.

- **Framework:** MLX
- **Quantization:** MXFP4
- **Performance:** Blazing fast! From my limited testing, you can expect speeds of **70-90 tokens per second** on an M4 Pro Mac.

### LM Studio Configuration & A Little Hackery...

To get this model purring perfectly with tool-calling in LM Studio, a little creative problem-solving was required.

> I'm not a big Qwen guy, so I re-used a prompt template I knew worked with my last Gemma 3 MLX quant and I adapted it. Hey, if it works, it works! 😉

This workaround involved modifying the `.jinja` prompt template to ensure native tool-calling compatibility. Because of this, a few extra steps are needed for optimal performance:

- **Additional Stop Strings:** Custom stop strings are necessary to prevent the model from generating unwanted text.
- **Reinforcing System Prompt:** A specific system prompt helps guide the model's behavior.

To make your life easier, I've included an **LM Studio preset** (`.preset.json` file) in this repository. This preset includes the correct stop strings and a well-tuned sampling/generation configuration. Just load it up, and you're good to go!

---
---

## Original Model Card from BasedBase

*(The following is the original information provided by the model's creator.)*

### Model Description

This model is a distilled version of **`Qwen/Qwen3-Coder-30B-A3B-Instruct`** designed to achieve coding and reasoning capabilities approaching those of a much larger teacher model.

It is the result of applying a LoRA made via a SVD distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a **62-layer, 160-expert teacher model** into the more efficient **48-layer, 128-expert architecture** of the `Qwen3-Coder-30b-a3b` student model.

The primary goal was to significantly enhance performance on **complex coding tasks**, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical.

### The Distillation Methodology

This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the `SVD-based` script. This pipeline was designed to ensure maximum precision and knowledge transfer.

#### Core Components

*   **Teacher Model:** 'Qwen/Qwen3-Coder-480B-A35B-Instruct'.
*   **Student Model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`.
*   **LoRA Rank:** A high rank of **`r=2048`** was used for all modules to capture a very high degree of information from the teacher.

#### The Distillation Pipeline

For each corresponding layer in the student and teacher, the following pipeline was executed:

1.  **Spherical Linear Interpolation (SLERP):** For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging.

2.  **Singular Value Decomposition (SVD) Projection:** The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (`U`, `S`, `V`). The **top 2048** most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity.

3.  **Procrustes Analysis:** After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference.

4.  **DARE (Drop and Rescale):** The difference tensor (`Distilled - Aligned Student`) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA.

#### Mixture-of-Experts (MoE) Distillation

The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning.

*   **Expert Fingerprinting & Clustering:** To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." **K-Means clustering** was then used to group these 160 fingerprints into 128 distinct clusters.
*   **Expert-to-Expert Distillation:** Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred.
*   **Router Gate Distillation:** The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic.

### Intended Use

This model is intended for **code generation**. It should be better at tasks that require understanding complex logic, algorithms, and software architecture.

*   **Primary Use:** Code generation, refactoring, explanation (although since its an instruct it may not be perfect for explaining things), and debugging.
*   **Out of Scope:** This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks.