Shannon Control Unit (SCU) — Cruise Control for LLM Training

Patent Pending Hugging Face Open In Colab Website

Model Weights: Llama 3.2 Community License | Code: AGPL-3.0 for research/academia — commercial licenses available (GitHub)

Like cruise control maintains your speed regardless of hills, SCU maintains optimal regularization regardless of data complexity.

Set your target information ratio ( S^* ), and our PI controller automatically adjusts ( \lambda ) to maintain it throughout training. No manual hyperparameter tuning required.

Validated Results:

Model Metric Baseline SCU Improvement
Llama-3.2-1B BPT 3.920 3.676 -6.2%
Perplexity 15.14 12.78 -15.6%
Llama-3.2-3B 🎯 BPT 1.830 1.635 -10.6%
Perplexity 3.56 3.11 -12.6%

Status: Validated at 1B/3B scales | Seeking partners for 7B+ external validation

View validation artifacts | Evaluation protocol

Available Models

Model Location Training Final BPT Improvement
Llama-3.2-1B + SCU hunterbown/shannon-control-unit PI Control (S*=1%) 3.676 -6.2%
Llama-3.2-3B + SCU subfolder="3b-scu" PI Control (S*=3%) 1.635 -10.6%

Note: Both are LoRA adapters. Load base models from Meta first, then apply our SCU adapters.

Validation Results

Data Files

Planned Comparisons and Baselines

To rigorously validate the SCU approach, we plan to compare against the following baselines, reporting means and 95% CIs across multiple seeds with fixed token budgets:

  • Optimal fixed regularization (grid/Bayesian search for the best constant λ)
  • Scheduled regularization (tuned λ decay: linear, cosine)
  • Adaptive KL control (controller targeting a fixed KL from the base model)
  • Hyperparameter sensitivity (S*, Kp, Ki, σ) and step‑time overhead (<1–2%)

Control Telemetry

Lambda Evolution

Adaptive λ(t): Real-time regularization strength adjustments in response to S-ratio deviations


How SCU Training Works

S-ratio Tracking

Real control dynamics: S(t) oscillates around target (1.0% ± 0.2pp) showing active PI control adjustments. This is actual telemetry from training, not a simulation.

Ablation Study: Adaptive vs Fixed λ

Ablation Summary

Result: PI control achieves 1.8% better BPT than best fixed-λ, proving adaptive regularization works.


Quick start (adapters)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# For 1B model (validated with 6.2% BPT improvement)
base_id = "meta-llama/Llama-3.2-1B"  # accept terms on HF first
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32)
tok  = AutoTokenizer.from_pretrained(base_id)
if tok.pad_token is None: tok.pad_token = tok.eos_token
base.config.pad_token_id = tok.pad_token_id

# Load the validated 1B adapter (main directory or 1b-scu/)
model = PeftModel.from_pretrained(base, "hunterbown/shannon-control-unit")  

# Or for 3B models, use:
# base_id = "meta-llama/Llama-3.2-3B"
# model = PeftModel.from_pretrained(base, "hunterbown/shannon-control-unit", subfolder="3b-scu")

Demo notebook: Open in Colab


How It Works (Cruise Control Analogy)

Just like cruise control in your car:

  • You set the target: Choose your information ratio $S^*$
  • SCU maintains it automatically: PI controller adjusts $\lambda$ in real-time
  • No manual intervention: Works across data distribution shifts and training dynamics

Technical Details:

  • Control variable: $S=\frac{\text{ParamBPT}}{\text{DataBPT}+\text{ParamBPT}}$
  • Control law: $\lambda \leftarrow \lambda \cdot \exp(-(K_p \cdot \text{error} + K_i \cdot I))$
  • Result: Automatic regularization without hyperparameter sweeps

Key Research Question: Optimal $S^*$ scaling laws are still being discovered. We found ~1.0% works for 1B models and ~2.88% for 3B models in our setup. We are investigating whether there is a simple “natural operating point” for $S^*$ that depends on model size ($M$), training tokens ($T$), and data domain ($D$): a compact relation $S^* \approx f(M, T, D)$. This is an open question; contributions welcome.

Get Involved (7B+ welcome)


Licensing & IP

  • Model weights: Meta Llama 3.2 Community License (inherited from base model)
  • SCU training code: AGPL-3.0 (research/academia). Commercial licenses available (GitHub repository)

Limitations and Threats to Validity

Current results are for LoRA finetunes of Llama‑3.2 1B/3B on a ~512k‑token WikiText‑103 subset. We have not yet shown results for full‑parameter training or 70B+. SCU must be compared against optimally tuned fixed‑λ and strong schedules, as well as adaptive KL targeting, with multi‑seed reporting and downstream checks (e.g., MMLU/GSM8K) to ensure utility is not reduced.

Positioning: SCU is a training‑time mechanism that adjusts λ to maintain a target information ratio S*; it is distinct from inference‑time uncertainty/refinement loops that modify generation without changing the model’s weights.

  • IP status: U.S. patent pending (provisional filed September 2025)

Repro tips: block size 1024, batch 1, grad-accum 4, gradient checkpointing on, use_cache=False.

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hunterbown/shannon-control-unit

Adapter
(590)
this model