qwen35-9b-opus46-mix-i1-GGUF

This repository contains GGUF exports of a Qwen 3.5 9B finetune based on unsloth/Qwen3.5-9B, prepared for local inference with llama.cpp-compatible runtimes such as llama.cpp, LM Studio, Jan, Open WebUI backends, and similar tools.

The naming is intentional:

  • qwen35-9b = base model family
  • opus46 = primary training source was nohurry/Opus-4.6-Reasoning-3000x-filtered
  • mix = extra training data was blended in beyond the primary source
  • i1 = imatrix was used during GGUF quantization

Model Summary

  • Base model: unsloth/Qwen3.5-9B
  • Format: GGUF
  • Intended runtimes: llama.cpp and compatible local UIs
  • Quantizations in this repo:
    • Q4_K_M
    • Q8_0
  • Main goal: stronger local reasoning behavior and more structured assistant outputs than a plain stock GGUF export

Training / Build Notes

This release came from a full train/export pipeline rather than a direct one-step conversion.

Workflow:

  • LoRA SFT with ms-swift
  • Base model: unsloth/Qwen3.5-9B
  • Primary dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Extra mixed datasets:
    • Salesforce/xlam-function-calling-60k
    • OpenAssistant/oasst2
  • Export path: LoRA merge, GGUF conversion via llama.cpp, imatrix-assisted quantization for the imatrix-tagged output set

This is still a Qwen 3.5 model family export. The added training data changes behavior, not the underlying GGUF architecture family or tokenizer family.

What To Expect

Relative to a plain base-model GGUF, this model was aimed more toward:

  • stronger reasoning-style responses
  • better structured assistant behavior
  • better tool / function-call shaped outputs
  • stronger instruction following

That does not guarantee it will outperform stock Qwen3.5-9B for every prompt. Treat this as a targeted finetune, not a universal replacement.

Files

This repository contains:

  • *Q4_K_M.gguf
  • *Q8_0.gguf

Q4_K_M is the smaller and more practical default for many local setups.

Q8_0 is larger, but may preserve more quality if you have the RAM / VRAM budget.

Benchmark Snapshot

I ran both speed and an initial quality check on a local RTX 4090 machine.

Hardware / runtime:

  • GPU: NVIDIA GeForce RTX 4090
  • CPU: AMD Ryzen 9 7900X
  • llama.cpp build commit: 6729d49
  • GPU offload: -ngl 99
  • llama-bench batch settings: n_batch=2048, n_ubatch=512
Quant Prompt tok/s (512 prompt, 0 gen) Prompt tok/s (1024 prompt, 0 gen) Gen tok/s (0 prompt, 128 gen)
Q4_K_M 9838.32 9748.59 137.57
Q8_0 9974.62 9954.67 92.40

Interpretation:

  • Q4_K_M is much faster for generation on this hardware.
  • Q8_0 is slower on decode, but may still be worth trying if you want the larger quant.
  • These numbers were produced from the actual released GGUF files, not a different local build.

Initial quality snapshot:

  • Task: gsm8k
  • Setup: lm-eval-harness using local-completions against llama-server
  • Quant: Q4_K_M
  • Tokenizer reference: Qwen/Qwen3-8B
  • Server context: 8192
  • Concurrency: 4
  • Result:
    • flexible-extract exact_match = 0.8415
    • strict-match exact_match = 0.8400

This is a real full-run GSM8K result, not a limited smoke test. I have not posted a broader multi-task quality table yet.

Prompting

This is still a Qwen-family chat model, so standard chat prompting is appropriate.

Simple example:

You are a helpful assistant.

User: Explain what GGUF is and when to use Q4_K_M versus Q8_0.
Assistant:

If your runtime supports a native Qwen chat template, prefer that.

llama.cpp Example

./llama-cli \
  -m ./qwen35-9b-opus46-mix-i1-Q4_K_M.gguf \
  -c 4096 \
  -ngl 99 \
  -p "You are a helpful assistant.\n\nUser: Explain what imatrix means in a GGUF release.\nAssistant:"

Notes

  • GGUF files are for inference, not further fine-tuning.
  • Quantization level, context size, prompt format, and runtime all affect quality.
  • The i1 tag indicates imatrix usage in the quantization flow. It is a quantization/build detail, not a separate base model family.

Limitations

  • Quantized models can lose quality relative to higher-precision checkpoints.
  • This model can still hallucinate facts, tools, or arguments.
  • Function-calling style does not guarantee correctness of produced calls.
  • Verify important outputs before using them in real workflows.

License

This release follows the license and usage constraints of the base model and the training sources used in the workflow. Review the upstream model and datasets before downstream use:

  • unsloth/Qwen3.5-9B
  • nohurry/Opus-4.6-Reasoning-3000x-filtered
  • Salesforce/xlam-function-calling-60k
  • OpenAssistant/oasst2

Credits

  • Base model: Qwen / Unsloth
  • Training stack: ms-swift
  • GGUF conversion and quantization: llama.cpp
Downloads last month
346
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for slyfox1186/qwen35-9b-opus46-mix-i1-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(8)
this model