Veyllo VQ-1: High-Density Reasoning

Veyllo Lab Banner

VQ-1 (Veyllo Qwen V1) is a proof-of-concept model demonstrating efficient reasoning on consumer hardware by reduce token consumption and loop failures in complex logical tasks compared to the base model.

By fine-tuning a constrained, 4-bit quantized base model (Qwen 3 8B) on a small, high logic density dataset of 3,260 reasoning examples, VQ-1 achieves state-of-the-art efficiency in logical tasks. It outperforms its own unquantized base model and larger "reasoning" models in terms of token efficiency and stability.

⚑ Key Highlights

  • Efficiency First: Optimized for the Reasoning Efficiency Score (RES). It solves complex problems without "thinking" for 1,000 tokens.
  • 4-Bit Native: Trained directly on top of the 4-bit quantized weights of Qwen 3 using QLoRA.
  • Stable Logic: Eliminates the "collapse" and loops often seen in base models when handling strict constraints (e.g., Modulo Math, Resource Triage).

πŸ“Š Evaluation: Precision Beats Volume

We benchmarked VQ-1 against the Qwen 3 Base model and leading reasoning models. The goal was not just accuracy, but efficiency (Accuracy per Token).

The Reasoning Efficiency Score (RES)

RES=ComplexityΓ—AccuracyTokenCountRES = \frac{Complexity \times Accuracy}{TokenCount}

Results

(See the "All Tasks RES Comparison" chart in the linked Engineering Note)

  • vs. Base Model: VQ-1 consistently outperforms the Qwen 3 base model in tasks requiring strict constraints, using significantly fewer tokens.
  • vs. Reasoning Models: While larger reasoning models provide correct answers, they often require 2-3x the token count to derive them. VQ-1 finds the solution, drastically reducing latency.
Metric VQ-1 (Ours) Qwen 3 Base Competitor (Reasoning)
Avg. Tokens per Solution ~660 ~993 ~1200+
Logic Stability High Low (Loops) High

πŸ’» How to Use

Option 1: Terminal / llama.cpp (Recommended & Stable) πŸ†

This is the most reliable method to use VQ-1. LM Studio and other GUIs often struggle with the internal "Thinking" process, causing them to cut off answers.

Run the model in interactive mode (-cnv) with the defined system identity:

./llama-cli -m VQ-1_Instruct-q4_k_m.gguf -c 8192 -p "Du bist VQ-1, ein hilfreicher Assistent von Veyllo Labs." -cnv

Option 2: LM Studio (Experimental / Known Issues) ⚠️

Note: valid setup often requires manual tweaking. Known Issue: The model "thinks" internally (using <think>...</think> tags) which consumes tokens.

  • Result: The model appears to stop generating before the answer appears.
  • Fix:
    1. Set Context Length to max (8192 or even 40960).
    2. Ensure </think> is NOT in your "Stop Strings".
  1. Download the .gguf file.
  2. Load it in LM Studio.
  3. Apply the settings above.
  4. System Prompt:

    Du bist VQ-1, ein hilfreicher Assistent von Veyllo Labs.

Option 3: Ollama (Command Line)

Since a Modelfile is included:

  1. Download Modelfile and the .gguf file.
  2. Run: ollama create vq-1 -f Modelfile
  3. Run: ollama run vq-1

πŸ› οΈ Training Details

The model was trained using High-Density Fine-Tuning, a method focusing on the quality and logical depth of samples rather than dataset size.

  • Base Model: Qwen 3 8B (bnb-4bit)
  • Method: QLoRA (Rank: 32, Alpha: 64) -> Merged to GGUF
  • Dataset: 3,260 curated logic samples (Veyllo Internal)
  • Epochs: 3
  • Hardware: Trained on single RTX 3080 GPU.

πŸ”— Complete Article and Benchmarks

For a deep dive into the methodology, read the full Engineering Note: Read the full report on Veyllo.io Developed by Veyllo Labs (Mert Can Elsner)

Downloads last month
313
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support