DavidAU's picture
Create README.md
6e4c936 verified
|
raw
history blame
5.89 kB
metadata
license: apache-2.0
library_name: transformers
language:
  - en
  - fr
  - zh
  - de
tags:
  - programming
  - code generation
  - code
  - codeqwen
  - programming
  - code generation
  - code
  - codeqwen
  - moe
  - coding
  - coder
  - qwen2
  - chat
  - qwen
  - qwen-coder
  - chat
  - qwen
  - qwen-coder
  - moe
  - mixture of experts
  - 4 experts
  - 2 active experts
  - 40k context
  - qwen3
  - finetune
  - qwen3_moe
  - creative
  - all use cases
  - roleplay
  - merge
base_model:
  - Qwen/Qwen3-0.6B
  - suayptalha/Qwen3-0.6B-Code-Expert
  - suayptalha/Qwen3-0.6B-Math-Expert
  - suayptalha/Qwen3-0.6B-Medical-Expert
pipeline_tag: text-generation

Qwen3-MOE-4x0.6B-2.4B-Writing-Thunder-V1.2

This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly.

This is a general purpose MOE - Mixture of Experts of 4 models at 0.6B (2.4B) in a Mixtral type Qwen 3 structure compressed to 1.54B.

Special thanks to all the model makers (see model tree) and of course TEAM Qwen!

This model exceeds 200 t/s with 2 experts activated (default) on a mid level card, 2x/3x times this on high end card with CPU only performance will be 50+ t/s.

This model excels at high speed creative tasks, but can also be used for general tasks too. For coding/complex tasks, strongly suggest Q8 or full precision.

Full generation at the bottom of this page at q4_k_s.

This has all the power of Qwen 3 and MOE model in one.

This version has the NATIVE context of 40k.

This is a full thinking model.

The MOE structure reduces the size of the token thinking block.

I have included an optional system prompt to invoke "thinking" in this model, if you want to activate for all use cases.

Usually the model will self-activate thinking by itself.

SETTINGS:

For coding, programming set expert to:

  • 2 for general work.
  • 3 for moderate work.
  • 4 for complex work, long projects, complex coding.
  • Suggest min context window 4k to 8k.
  • And for longer context, and/or multi-turn -> increase experts by 1-2 to help with longer context/multi turn understanding.

Recommended settings - general:

  • Rep pen 1.05 to 1.1 ; however rep pen of 1 will work well (may need to raise it for lower quants/fewer activated experts)
  • Temp .3 to .6 (+- .2)
  • Topk of 20, 40 or 100
  • Topp of .95 / min p of .05
  • Suggest min context window 4k to 8k.
  • System prompt (optional) to focus the model better.

OPTIONAL SYSTEM PROMPT - INVOKE "Thinking":

Enable deep thinking subroutine. You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside ###ponder### ###/ponder### tags, and then provide your solution or response to the problem.

Use this to INVOKE "thinking" block(s) in the model. These will be a lot shorter than 1000s of tokens generally in most "thinking" models.

In you use this prompt, you may need to raise "rep pen" to 1.08 to 1.1, to prevent "loops" in the "thought block(s)" ; especially in lower quants.

If you change "ponder" to a different word/phrase this will affect model "thinking" too.


QUANTS


PENDING


Help, Adjustments, Samplers, Parameters and More


CHANGE THE NUMBER OF ACTIVE EXPERTS:

See this document:

https://huggingface.co/DavidAU/How-To-Set-and-Manage-MOE-Mix-of-Experts-Model-Activation-of-Experts

Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:

In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;

Set the "Smoothing_factor" to 1.5

: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"

: in text-generation-webui -> parameters -> lower right.

: In Silly Tavern this is called: "Smoothing"

NOTE: For "text-generation-webui"

-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)

Source versions (and config files) of my models are here:

https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be

OTHER OPTIONS:

  • Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")

  • If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.

Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers

This a "Class 1" model:

For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]


Example Generation:

Q4_K_S ; Temp .7, rep pen 1.05, top k: 40, topp .95, min p 0.5 ; LMSTudio.

This is a mid range quant, expect better performance at Q8 / 16 bit full precision.

Imatrix versions (quants) will also exceed this quality too.


Start a 1000 word scene (vivid, graphic horror - include blood, guts and gore - in first person), POV character Diana, with: The skyscraper sways, as I watch the window in front of me on the 21st floor explode...