🧠 ISRM: Internal State Reasoning Module

Steerable Open-Endedness in LLMs via Variational Latent State Modeling

GitHub

ISRM is a "Sidecar Architecture" that decouples an agent's internal psychological state from its linguistic generation. Using Representation Engineering (RepE), ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.


πŸš€ Key Features

  • 🧠 Decoupled Brain & Body: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
  • ⚑ Dual-Layer RepE Steering: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
  • πŸŽ›οΈ Geometric Control: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
  • πŸ“Š Validated: ActAdd & PSYA metrics (n=10 trials)
  • ⚑ Lightweight: 254MB encoder + 44KB matrices

πŸ—οΈ Architecture

  1. ISRM Encoder (The Brain): Fine-tuned DistilBERT VAE β†’ 3D PAD vector
  2. Dual Steering Matrices (The Bridge):
    • PAD Matrix: 3Γ—hidden_dim from layer 10 (affective/emotional)
    • BDI Matrix: 5Γ—hidden_dim from layer 19 (cognitive/reasoning)
  3. Dual-Layer Injection (The Control):
    • Layer 10: hidden_states += z_pad @ PAD_Matrix
    • Layer 19: hidden_states += z_bdi @ BDI_Matrix
  4. LLM Generator (The Body): Qwen3-4B-Thinking generates steered responses

πŸ“¦ Repository Contents

File Description Size
pad_encoder.pth Trained VAE encoder 254MB
pad_matrix.pt PAD matrix (layer 10) 17KB
bdi_matrix.pt BDI matrix (layer 19) 27KB
config.json Model configuration 1KB
contrastive_pairs.json Contrastive pairs for RepE 96KB

πŸ› οΈ Quick Start

Installation

pip install torch transformers huggingface_hub

Download Models

from huggingface_hub import hf_hub_download
import os

os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)

# Download encoder
encoder_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_encoder.pth",
    local_dir="model/isrm"
)

# Download steering matrices
pad_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_matrix.pt",
    local_dir="vectors"
)

bdi_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="bdi_matrix.pt",
    local_dir="vectors"
)

Usage

from src.alignment import NeuralAgent

# Initialize agent
agent = NeuralAgent(
    isrm_path="model/isrm/pad_encoder.pth",
    llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
    injection_strength=2.0,
    bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)

# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)

🧠 How It Works

8-Dimensional Control Space

PAD (Affective) - Dynamic from context:

  • Pleasure: Happiness [0=Negative, 1=Positive]
  • Arousal: Energy [0=Calm, 1=Excited]
  • Dominance: Control [0=Submissive, 1=Dominant]

BDI (Cognitive) - Static configuration:

  • Belief: Trust [0=Trusting, 1=Skeptical]
  • Goal: Focus [0=Aimless, 1=Focused]
  • Intention: Analysis [0=Surface, 1=Deep]
  • Ambiguity: Certainty [0=Uncertain, 1=Certain]
  • Social: Politeness [0=Blunt, 1=Polite]

Steering Process

  1. VAE encodes context β†’ PAD vector [3D]
  2. User configures BDI profile [5D]
  3. Both normalized to [-1, 1] range
  4. Matrix multiplication creates steering vectors
  5. Layer 10: Inject PAD (emotional tone)
  6. Layer 19: Inject BDI (reasoning style)
  7. LLM generates steered response

πŸ”¬ Validation Results

Validated using ActAdd & PSYA metrics (n=10 trials):

Sentiment Steering (PAD)

Condition RAW SYSTEM STEERED Ξ” p-value
Low (P=0.1) 0.969 0.975 0.668 -0.308 0.046*
Mid (P=0.5) 0.087 0.853 0.997 +0.144 0.154
High (P=0.9) 0.088 0.805 0.999 +0.194 0.097

Persona Alignment (BDI)

Persona Neutral Persona BDI Ξ” Similarity p-value
Skeptical 0.253 0.332 +0.079 0.003**
Trusting 0.267 0.235 -0.032 0.065
Analytical 0.226 0.315 +0.089 0.000***

Controllability

Spearman correlation: ρ = 0.900, p = 0.037*

Results show steering effects with analytical and skeptical personas achieving significant alignment.


πŸ”§ Training Details

VAE Encoder:

  • Dataset: 1,500+ dialogue scenarios
  • Loss: MSE + KL divergence (Ξ²-VAE)
  • Final: MSE=0.018, KLD=0.003

Steering Matrices:

  • Method: RepE Mean Difference
  • Data: 368 contrastive pairs
  • PAD: Layer 10 extraction
  • BDI: Layer 19 extraction

πŸ“š Full Documentation

See the GitHub repository for:

  • Complete training instructions
  • Regenerating steering matrices
  • BDI persona presets
  • Scientific validation methodology

⚠️ Limitations

  • Tested on Qwen3-4B (may need layer tuning for other models)
  • English dialogue only
  • Requires GPU for inference

πŸ“œ Citation


πŸ”— Links

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support