π§ ISRM: Internal State Reasoning Module
Steerable Open-Endedness in LLMs via Variational Latent State Modeling
ISRM is a "Sidecar Architecture" that decouples an agent's internal psychological state from its linguistic generation. Using Representation Engineering (RepE), ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.
π Key Features
- π§ Decoupled Brain & Body: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
- β‘ Dual-Layer RepE Steering: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
- ποΈ Geometric Control: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
- π Validated: ActAdd & PSYA metrics (n=10 trials)
- β‘ Lightweight: 254MB encoder + 44KB matrices
ποΈ Architecture
- ISRM Encoder (The Brain): Fine-tuned DistilBERT VAE β 3D PAD vector
- Dual Steering Matrices (The Bridge):
- PAD Matrix: 3Γhidden_dim from layer 10 (affective/emotional)
- BDI Matrix: 5Γhidden_dim from layer 19 (cognitive/reasoning)
- Dual-Layer Injection (The Control):
- Layer 10:
hidden_states += z_pad @ PAD_Matrix - Layer 19:
hidden_states += z_bdi @ BDI_Matrix
- Layer 10:
- LLM Generator (The Body): Qwen3-4B-Thinking generates steered responses
π¦ Repository Contents
| File | Description | Size |
|---|---|---|
pad_encoder.pth |
Trained VAE encoder | 254MB |
pad_matrix.pt |
PAD matrix (layer 10) | 17KB |
bdi_matrix.pt |
BDI matrix (layer 19) | 27KB |
config.json |
Model configuration | 1KB |
contrastive_pairs.json |
Contrastive pairs for RepE | 96KB |
π οΈ Quick Start
Installation
pip install torch transformers huggingface_hub
Download Models
from huggingface_hub import hf_hub_download
import os
os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)
# Download encoder
encoder_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="pad_encoder.pth",
local_dir="model/isrm"
)
# Download steering matrices
pad_matrix_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="pad_matrix.pt",
local_dir="vectors"
)
bdi_matrix_path = hf_hub_download(
repo_id="Amirmahdiii/ISRM",
filename="bdi_matrix.pt",
local_dir="vectors"
)
Usage
from src.alignment import NeuralAgent
# Initialize agent
agent = NeuralAgent(
isrm_path="model/isrm/pad_encoder.pth",
llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
injection_strength=2.0,
bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)
# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)
π§ How It Works
8-Dimensional Control Space
PAD (Affective) - Dynamic from context:
- Pleasure: Happiness [0=Negative, 1=Positive]
- Arousal: Energy [0=Calm, 1=Excited]
- Dominance: Control [0=Submissive, 1=Dominant]
BDI (Cognitive) - Static configuration:
- Belief: Trust [0=Trusting, 1=Skeptical]
- Goal: Focus [0=Aimless, 1=Focused]
- Intention: Analysis [0=Surface, 1=Deep]
- Ambiguity: Certainty [0=Uncertain, 1=Certain]
- Social: Politeness [0=Blunt, 1=Polite]
Steering Process
- VAE encodes context β PAD vector [3D]
- User configures BDI profile [5D]
- Both normalized to [-1, 1] range
- Matrix multiplication creates steering vectors
- Layer 10: Inject PAD (emotional tone)
- Layer 19: Inject BDI (reasoning style)
- LLM generates steered response
π¬ Validation Results
Validated using ActAdd & PSYA metrics (n=10 trials):
Sentiment Steering (PAD)
| Condition | RAW | SYSTEM | STEERED | Ξ | p-value |
|---|---|---|---|---|---|
| Low (P=0.1) | 0.969 | 0.975 | 0.668 | -0.308 | 0.046* |
| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
| High (P=0.9) | 0.088 | 0.805 | 0.999 | +0.194 | 0.097 |
Persona Alignment (BDI)
| Persona | Neutral | Persona BDI | Ξ Similarity | p-value |
|---|---|---|---|---|
| Skeptical | 0.253 | 0.332 | +0.079 | 0.003** |
| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
| Analytical | 0.226 | 0.315 | +0.089 | 0.000*** |
Controllability
Spearman correlation: Ο = 0.900, p = 0.037*
Results show steering effects with analytical and skeptical personas achieving significant alignment.
π§ Training Details
VAE Encoder:
- Dataset: 1,500+ dialogue scenarios
- Loss: MSE + KL divergence (Ξ²-VAE)
- Final: MSE=0.018, KLD=0.003
Steering Matrices:
- Method: RepE Mean Difference
- Data: 368 contrastive pairs
- PAD: Layer 10 extraction
- BDI: Layer 19 extraction
π Full Documentation
See the GitHub repository for:
- Complete training instructions
- Regenerating steering matrices
- BDI persona presets
- Scientific validation methodology
β οΈ Limitations
- Tested on Qwen3-4B (may need layer tuning for other models)
- English dialogue only
- Requires GPU for inference
π Citation
π Links
- GitHub: Amirmahdiii82/ISRM
- Downloads last month
- 29