---
language: en
license: apache-2.0
tags:
- steering
- representation-engineering
- affect-control
- vae
- dual-layer
datasets:
- custom
metrics:
- mse
- cosine-similarity
library_name: transformers
pipeline_tag: feature-extraction
---

# 🧠 ISRM: Internal State Reasoning Module

**Steerable Open-Endedness in LLMs via Variational Latent State Modeling**

[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM)

ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning.

-----

## 🚀 Key Features

- **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression
- **⚡ Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference
- **🎛️ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social)
- **📊 Validated**: ActAdd & PSYA metrics (n=10 trials)
- **⚡ Lightweight**: 254MB encoder + 44KB matrices

-----

## 🏗️ Architecture

1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE → 3D PAD vector
2. **Dual Steering Matrices (The Bridge)**:
   - **PAD Matrix**: 3×hidden_dim from layer 10 (affective/emotional)
   - **BDI Matrix**: 5×hidden_dim from layer 19 (cognitive/reasoning)
3. **Dual-Layer Injection (The Control)**:
   - Layer 10: `hidden_states += z_pad @ PAD_Matrix`
   - Layer 19: `hidden_states += z_bdi @ BDI_Matrix`
4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses

-----

## 📦 Repository Contents

| File | Description | Size |
|------|-------------|------|
| `pad_encoder.pth` | Trained VAE encoder | 254MB |
| `pad_matrix.pt` | PAD matrix (layer 10) | 17KB |
| `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB |
| `config.json` | Model configuration | 1KB |
| `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB |

-----

## 🛠️ Quick Start

### Installation

```bash
pip install torch transformers huggingface_hub
```

### Download Models

```python
from huggingface_hub import hf_hub_download
import os

os.makedirs('model/isrm', exist_ok=True)
os.makedirs('vectors', exist_ok=True)

# Download encoder
encoder_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_encoder.pth",
    local_dir="model/isrm"
)

# Download steering matrices
pad_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="pad_matrix.pt",
    local_dir="vectors"
)

bdi_matrix_path = hf_hub_download(
    repo_id="Amirmahdiii/ISRM",
    filename="bdi_matrix.pt",
    local_dir="vectors"
)
```

### Usage

```python
from src.alignment import NeuralAgent

# Initialize agent
agent = NeuralAgent(
    isrm_path="model/isrm/pad_encoder.pth",
    llm_model_name="Qwen/Qwen3-4B-Thinking-2507",
    injection_strength=2.0,
    bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5}
)

# Generate
response, _, state = agent.generate_response("", "Tell me about AI safety.")
print(response)
```

-----

## 🧠 How It Works

### 8-Dimensional Control Space

**PAD (Affective) - Dynamic from context:**
- **Pleasure**: Happiness [0=Negative, 1=Positive]
- **Arousal**: Energy [0=Calm, 1=Excited]
- **Dominance**: Control [0=Submissive, 1=Dominant]

**BDI (Cognitive) - Static configuration:**
- **Belief**: Trust [0=Trusting, 1=Skeptical]
- **Goal**: Focus [0=Aimless, 1=Focused]
- **Intention**: Analysis [0=Surface, 1=Deep]
- **Ambiguity**: Certainty [0=Uncertain, 1=Certain]
- **Social**: Politeness [0=Blunt, 1=Polite]

### Steering Process

1. VAE encodes context → PAD vector [3D]
2. User configures BDI profile [5D]
3. Both normalized to [-1, 1] range
4. Matrix multiplication creates steering vectors
5. **Layer 10**: Inject PAD (emotional tone)
6. **Layer 19**: Inject BDI (reasoning style)
7. LLM generates steered response

-----

## 🔬 Validation Results

Validated using ActAdd & PSYA metrics (n=10 trials):

### Sentiment Steering (PAD)

| Condition | RAW | SYSTEM | STEERED | Δ | p-value |
|-----------|-----|--------|---------|---|---------|
| Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* |
| Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 |
| High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 |

### Persona Alignment (BDI)

| Persona | Neutral | Persona BDI | Δ Similarity | p-value |
|---------|---------|-------------|--------------|---------|
| Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** |
| Trusting | 0.267 | 0.235 | -0.032 | 0.065 |
| Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** |

### Controllability

Spearman correlation: **ρ = 0.900**, p = 0.037*

Results show steering effects with analytical and skeptical personas achieving significant alignment.

-----

## 🔧 Training Details

**VAE Encoder:**
- Dataset: 1,500+ dialogue scenarios
- Loss: MSE + KL divergence (β-VAE)
- Final: MSE=0.018, KLD=0.003

**Steering Matrices:**
- Method: RepE Mean Difference
- Data: 368 contrastive pairs
- PAD: Layer 10 extraction
- BDI: Layer 19 extraction

-----

## 📚 Full Documentation

See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for:
- Complete training instructions
- Regenerating steering matrices
- BDI persona presets
- Scientific validation methodology

-----

## ⚠️ Limitations

- Tested on Qwen3-4B (may need layer tuning for other models)
- English dialogue only
- Requires GPU for inference

-----

## 📜 Citation

```bibtex

```

## 🔗 Links

- **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)