Veyllo VQ-1: High-Density Reasoning
VQ-1 (Veyllo Qwen V1) is a proof-of-concept model demonstrating efficient reasoning on consumer hardware by reduce token consumption and loop failures in complex logical tasks compared to the base model.
By fine-tuning a constrained, 4-bit quantized base model (Qwen 3 8B) on a small, high logic density dataset of 3,260 reasoning examples, VQ-1 achieves state-of-the-art efficiency in logical tasks. It outperforms its own unquantized base model and larger "reasoning" models in terms of token efficiency and stability.
β‘ Key Highlights
- Efficiency First: Optimized for the Reasoning Efficiency Score (RES). It solves complex problems without "thinking" for 1,000 tokens.
- 4-Bit Native: Trained directly on top of the 4-bit quantized weights of Qwen 3 using QLoRA.
- Stable Logic: Eliminates the "collapse" and loops often seen in base models when handling strict constraints (e.g., Modulo Math, Resource Triage).
π Evaluation: Precision Beats Volume
We benchmarked VQ-1 against the Qwen 3 Base model and leading reasoning models. The goal was not just accuracy, but efficiency (Accuracy per Token).
The Reasoning Efficiency Score (RES)
Results
(See the "All Tasks RES Comparison" chart in the linked Engineering Note)
- vs. Base Model: VQ-1 consistently outperforms the Qwen 3 base model in tasks requiring strict constraints, using significantly fewer tokens.
- vs. Reasoning Models: While larger reasoning models provide correct answers, they often require 2-3x the token count to derive them. VQ-1 finds the solution, drastically reducing latency.
| Metric | VQ-1 (Ours) | Qwen 3 Base | Competitor (Reasoning) |
|---|---|---|---|
| Avg. Tokens per Solution | ~660 | ~993 | ~1200+ |
| Logic Stability | High | Low (Loops) | High |
π» How to Use
Option 1: Terminal / llama.cpp (Recommended & Stable) π
This is the most reliable method to use VQ-1. LM Studio and other GUIs often struggle with the internal "Thinking" process, causing them to cut off answers.
Run the model in interactive mode (-cnv) with the defined system identity:
./llama-cli -m VQ-1_Instruct-q4_k_m.gguf -c 8192 -p "Du bist VQ-1, ein hilfreicher Assistent von Veyllo Labs." -cnv
Option 2: LM Studio (Experimental / Known Issues) β οΈ
Note: valid setup often requires manual tweaking.
Known Issue: The model "thinks" internally (using <think>...</think> tags) which consumes tokens.
- Result: The model appears to stop generating before the answer appears.
- Fix:
- Set Context Length to max (
8192or even40960). - Ensure
</think>is NOT in your "Stop Strings".
- Set Context Length to max (
- Download the
.gguffile. - Load it in LM Studio.
- Apply the settings above.
- System Prompt:
Du bist VQ-1, ein hilfreicher Assistent von Veyllo Labs.
Option 3: Ollama (Command Line)
Since a Modelfile is included:
- Download
Modelfileand the.gguffile. - Run:
ollama create vq-1 -f Modelfile - Run:
ollama run vq-1
π οΈ Training Details
The model was trained using High-Density Fine-Tuning, a method focusing on the quality and logical depth of samples rather than dataset size.
- Base Model: Qwen 3 8B (bnb-4bit)
- Method: QLoRA (Rank: 32, Alpha: 64) -> Merged to GGUF
- Dataset: 3,260 curated logic samples (Veyllo Internal)
- Epochs: 3
- Hardware: Trained on single RTX 3080 GPU.
π Complete Article and Benchmarks
For a deep dive into the methodology, read the full Engineering Note: Read the full report on Veyllo.io Developed by Veyllo Labs (Mert Can Elsner)
- Downloads last month
- 313
4-bit
