qwen35-9b-opus46-mix-i1-GGUF

This repository contains GGUF exports of a Qwen 3.5 9B finetune based on unsloth/Qwen3.5-9B, prepared for local inference with llama.cpp-compatible runtimes such as llama.cpp, LM Studio, Jan, Open WebUI backends, and similar tools.

The naming is intentional:

qwen35-9b = base model family
opus46 = primary training source was nohurry/Opus-4.6-Reasoning-3000x-filtered
mix = extra training data was blended in beyond the primary source
i1 = imatrix was used during GGUF quantization

Model Summary

Base model: unsloth/Qwen3.5-9B
Format: GGUF
Intended runtimes: llama.cpp and compatible local UIs
Quantizations in this repo:
- Q4_K_M
- Q8_0
Main goal: stronger local reasoning behavior and more structured assistant outputs than a plain stock GGUF export

Training / Build Notes

This release came from a full train/export pipeline rather than a direct one-step conversion.

Workflow:

LoRA SFT with ms-swift
Base model: unsloth/Qwen3.5-9B
Primary dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered
Extra mixed datasets:
- Salesforce/xlam-function-calling-60k
- OpenAssistant/oasst2
Export path: LoRA merge, GGUF conversion via llama.cpp, imatrix-assisted quantization for the imatrix-tagged output set

This is still a Qwen 3.5 model family export. The added training data changes behavior, not the underlying GGUF architecture family or tokenizer family.

What To Expect

Relative to a plain base-model GGUF, this model was aimed more toward:

stronger reasoning-style responses
better structured assistant behavior
better tool / function-call shaped outputs
stronger instruction following

That does not guarantee it will outperform stock Qwen3.5-9B for every prompt. Treat this as a targeted finetune, not a universal replacement.

Files

This repository contains:

*Q4_K_M.gguf
*Q8_0.gguf

Q4_K_M is the smaller and more practical default for many local setups.

Q8_0 is larger, but may preserve more quality if you have the RAM / VRAM budget.

Benchmark Snapshot

I ran both speed and an initial quality check on a local RTX 4090 machine.

Hardware / runtime:

GPU: NVIDIA GeForce RTX 4090
CPU: AMD Ryzen 9 7900X
llama.cpp build commit: 6729d49
GPU offload: -ngl 99
llama-bench batch settings: n_batch=2048, n_ubatch=512

Quant	Prompt tok/s (512 prompt, 0 gen)	Prompt tok/s (1024 prompt, 0 gen)	Gen tok/s (0 prompt, 128 gen)
`Q4_K_M`	`9838.32`	`9748.59`	`137.57`
`Q8_0`	`9974.62`	`9954.67`	`92.40`

Interpretation:

Q4_K_M is much faster for generation on this hardware.
Q8_0 is slower on decode, but may still be worth trying if you want the larger quant.
These numbers were produced from the actual released GGUF files, not a different local build.

Initial quality snapshot:

Task: gsm8k
Setup: lm-eval-harness using local-completions against llama-server
Quant: Q4_K_M
Tokenizer reference: Qwen/Qwen3-8B
Server context: 8192
Concurrency: 4
Result:
- flexible-extract exact_match = 0.8415
- strict-match exact_match = 0.8400

This is a real full-run GSM8K result, not a limited smoke test. I have not posted a broader multi-task quality table yet.

Prompting

This is still a Qwen-family chat model, so standard chat prompting is appropriate.

Simple example:

You are a helpful assistant.

User: Explain what GGUF is and when to use Q4_K_M versus Q8_0.
Assistant:

If your runtime supports a native Qwen chat template, prefer that.

llama.cpp Example

./llama-cli \
  -m ./qwen35-9b-opus46-mix-i1-Q4_K_M.gguf \
  -c 4096 \
  -ngl 99 \
  -p "You are a helpful assistant.\n\nUser: Explain what imatrix means in a GGUF release.\nAssistant:"

Notes

GGUF files are for inference, not further fine-tuning.
Quantization level, context size, prompt format, and runtime all affect quality.
The i1 tag indicates imatrix usage in the quantization flow. It is a quantization/build detail, not a separate base model family.

Limitations

Quantized models can lose quality relative to higher-precision checkpoints.
This model can still hallucinate facts, tools, or arguments.
Function-calling style does not guarantee correctness of produced calls.
Verify important outputs before using them in real workflows.

License

This release follows the license and usage constraints of the base model and the training sources used in the workflow. Review the upstream model and datasets before downstream use:

unsloth/Qwen3.5-9B
nohurry/Opus-4.6-Reasoning-3000x-filtered
Salesforce/xlam-function-calling-60k
OpenAssistant/oasst2

Credits

Base model: Qwen / Unsloth
Training stack: ms-swift
GGUF conversion and quantization: llama.cpp

Downloads last month: 346

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

8-bit

Model tree for slyfox1186/qwen35-9b-opus46-mix-i1-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

unsloth/Qwen3.5-9B

Quantized

(8)

this model