🧠 ISRM: Internal State Reasoning Module

Steerable Open-Endedness in LLMs via Variational Latent State Modeling

ISRM is a "Sidecar Architecture" that decouples an agent's internal psychological state from its linguistic generation. Using Representation Engineering (RepE), ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.

🚀 Key Features

🧠 Decoupled Brain & Body: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
⚡ Dual-Layer RepE Steering: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
🎛️ Geometric Control: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
📊 Validated: ActAdd & PSYA metrics (n=10 trials)
⚡ Lightweight: 254MB encoder + 44KB matrices

🏗️ Architecture

ISRM Encoder (The Brain): Fine-tuned DistilBERT VAE → 3D PAD vector
Dual Steering Matrices (The Bridge):
- PAD Matrix: 3×hidden_dim from layer 10 (affective/emotional)
- BDI Matrix: 5×hidden_dim from layer 19 (cognitive/reasoning)
Dual-Layer Injection (The Control):
- Layer 10: hidden_states += z_pad @ PAD_Matrix
- Layer 19: hidden_states += z_bdi @ BDI_Matrix
LLM Generator (The Body): Qwen3-4B-Thinking generates steered responses

📦 Repository Contents

File	Description	Size
`pad_encoder.pth`	Trained VAE encoder	254MB
`pad_matrix.pt`	PAD matrix (layer 10)	17KB
`bdi_matrix.pt`	BDI matrix (layer 19)	27KB
`config.json`	Model configuration	1KB
`contrastive_pairs.json`	Contrastive pairs for RepE	96KB

🛠️ Quick Start

Installation

pip install torch transformers huggingface_hub

Download Models

from huggingface_hub import hf_hub_download
import os

os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)

# Download encoder
encoder_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_encoder.pth",
    local_dir="model/isrm"
)

# Download steering matrices
pad_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_matrix.pt",
    local_dir="vectors"
)

bdi_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="bdi_matrix.pt",
    local_dir="vectors"
)

Usage

from src.alignment import NeuralAgent

# Initialize agent
agent = NeuralAgent(
    isrm_path="model/isrm/pad_encoder.pth",
    llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
    injection_strength=2.0,
    bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)

# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)

🧠 How It Works

8-Dimensional Control Space

PAD (Affective) - Dynamic from context:

Pleasure: Happiness [0=Negative, 1=Positive]
Arousal: Energy [0=Calm, 1=Excited]
Dominance: Control [0=Submissive, 1=Dominant]

BDI (Cognitive) - Static configuration:

Belief: Trust [0=Trusting, 1=Skeptical]
Goal: Focus [0=Aimless, 1=Focused]
Intention: Analysis [0=Surface, 1=Deep]
Ambiguity: Certainty [0=Uncertain, 1=Certain]
Social: Politeness [0=Blunt, 1=Polite]

Steering Process

VAE encodes context → PAD vector [3D]
User configures BDI profile [5D]
Both normalized to [-1, 1] range
Matrix multiplication creates steering vectors
Layer 10: Inject PAD (emotional tone)
Layer 19: Inject BDI (reasoning style)
LLM generates steered response

🔬 Validation Results

Validated using ActAdd & PSYA metrics (n=10 trials):

Sentiment Steering (PAD)

Condition	RAW	SYSTEM	STEERED	Δ	p-value
Low (P=0.1)	0.969	0.975	0.668	-0.308	0.046*
Mid (P=0.5)	0.087	0.853	0.997	+0.144	0.154
High (P=0.9)	0.088	0.805	0.999	+0.194	0.097

Persona Alignment (BDI)

Persona	Neutral	Persona BDI	Δ Similarity	p-value
Skeptical	0.253	0.332	+0.079	0.003**
Trusting	0.267	0.235	-0.032	0.065
Analytical	0.226	0.315	+0.089	0.000***

Controllability

Spearman correlation: ρ = 0.900, p = 0.037*

Results show steering effects with analytical and skeptical personas achieving significant alignment.

🔧 Training Details

VAE Encoder:

Dataset: 1,500+ dialogue scenarios
Loss: MSE + KL divergence (β-VAE)
Final: MSE=0.018, KLD=0.003

Steering Matrices:

Method: RepE Mean Difference
Data: 368 contrastive pairs
PAD: Layer 10 extraction
BDI: Layer 19 extraction

📚 Full Documentation

See the GitHub repository for:

Complete training instructions
Regenerating steering matrices
BDI persona presets
Scientific validation methodology

⚠️ Limitations

Tested on Qwen3-4B (may need layer tuning for other models)
English dialogue only
Requires GPU for inference

📜 Citation

🔗 Links

GitHub: Amirmahdiii82/ISRM

Downloads last month: 29