qwen35-9b-opus46-mix-i1-GGUF
This repository contains GGUF exports of a Qwen 3.5 9B finetune based on unsloth/Qwen3.5-9B, prepared for local inference with llama.cpp-compatible runtimes such as llama.cpp, LM Studio, Jan, Open WebUI backends, and similar tools.
The naming is intentional:
qwen35-9b= base model familyopus46= primary training source wasnohurry/Opus-4.6-Reasoning-3000x-filteredmix= extra training data was blended in beyond the primary sourcei1= imatrix was used during GGUF quantization
Model Summary
- Base model:
unsloth/Qwen3.5-9B - Format: GGUF
- Intended runtimes:
llama.cppand compatible local UIs - Quantizations in this repo:
Q4_K_MQ8_0
- Main goal: stronger local reasoning behavior and more structured assistant outputs than a plain stock GGUF export
Training / Build Notes
This release came from a full train/export pipeline rather than a direct one-step conversion.
Workflow:
- LoRA SFT with
ms-swift - Base model:
unsloth/Qwen3.5-9B - Primary dataset:
nohurry/Opus-4.6-Reasoning-3000x-filtered - Extra mixed datasets:
Salesforce/xlam-function-calling-60kOpenAssistant/oasst2
- Export path: LoRA merge, GGUF conversion via
llama.cpp, imatrix-assisted quantization for the imatrix-tagged output set
This is still a Qwen 3.5 model family export. The added training data changes behavior, not the underlying GGUF architecture family or tokenizer family.
What To Expect
Relative to a plain base-model GGUF, this model was aimed more toward:
- stronger reasoning-style responses
- better structured assistant behavior
- better tool / function-call shaped outputs
- stronger instruction following
That does not guarantee it will outperform stock Qwen3.5-9B for every prompt. Treat this as a targeted finetune, not a universal replacement.
Files
This repository contains:
*Q4_K_M.gguf*Q8_0.gguf
Q4_K_M is the smaller and more practical default for many local setups.
Q8_0 is larger, but may preserve more quality if you have the RAM / VRAM budget.
Benchmark Snapshot
I ran both speed and an initial quality check on a local RTX 4090 machine.
Hardware / runtime:
- GPU:
NVIDIA GeForce RTX 4090 - CPU:
AMD Ryzen 9 7900X llama.cppbuild commit:6729d49- GPU offload:
-ngl 99 llama-benchbatch settings:n_batch=2048,n_ubatch=512
| Quant | Prompt tok/s (512 prompt, 0 gen) | Prompt tok/s (1024 prompt, 0 gen) | Gen tok/s (0 prompt, 128 gen) |
|---|---|---|---|
Q4_K_M |
9838.32 |
9748.59 |
137.57 |
Q8_0 |
9974.62 |
9954.67 |
92.40 |
Interpretation:
Q4_K_Mis much faster for generation on this hardware.Q8_0is slower on decode, but may still be worth trying if you want the larger quant.- These numbers were produced from the actual released GGUF files, not a different local build.
Initial quality snapshot:
- Task:
gsm8k - Setup:
lm-eval-harnessusinglocal-completionsagainstllama-server - Quant:
Q4_K_M - Tokenizer reference:
Qwen/Qwen3-8B - Server context:
8192 - Concurrency:
4 - Result:
flexible-extract exact_match = 0.8415strict-match exact_match = 0.8400
This is a real full-run GSM8K result, not a limited smoke test. I have not posted a broader multi-task quality table yet.
Prompting
This is still a Qwen-family chat model, so standard chat prompting is appropriate.
Simple example:
You are a helpful assistant.
User: Explain what GGUF is and when to use Q4_K_M versus Q8_0.
Assistant:
If your runtime supports a native Qwen chat template, prefer that.
llama.cpp Example
./llama-cli \
-m ./qwen35-9b-opus46-mix-i1-Q4_K_M.gguf \
-c 4096 \
-ngl 99 \
-p "You are a helpful assistant.\n\nUser: Explain what imatrix means in a GGUF release.\nAssistant:"
Notes
- GGUF files are for inference, not further fine-tuning.
- Quantization level, context size, prompt format, and runtime all affect quality.
- The
i1tag indicates imatrix usage in the quantization flow. It is a quantization/build detail, not a separate base model family.
Limitations
- Quantized models can lose quality relative to higher-precision checkpoints.
- This model can still hallucinate facts, tools, or arguments.
- Function-calling style does not guarantee correctness of produced calls.
- Verify important outputs before using them in real workflows.
License
This release follows the license and usage constraints of the base model and the training sources used in the workflow. Review the upstream model and datasets before downstream use:
unsloth/Qwen3.5-9Bnohurry/Opus-4.6-Reasoning-3000x-filteredSalesforce/xlam-function-calling-60kOpenAssistant/oasst2
Credits
- Base model: Qwen / Unsloth
- Training stack:
ms-swift - GGUF conversion and quantization:
llama.cpp
- Downloads last month
- 346
4-bit
8-bit