--- language: en license: apache-2.0 tags: - steering - representation-engineering - affect-control - vae - dual-layer datasets: - custom metrics: - mse - cosine-similarity library_name: transformers pipeline_tag: feature-extraction --- # 🧠 ISRM: Internal State Reasoning Module **Steerable Open-Endedness in LLMs via Variational Latent State Modeling** [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/Amirmahdiii82/ISRM) ISRM is a "Sidecar Architecture" that decouples an agent's **internal psychological state** from its **linguistic generation**. Using **Representation Engineering (RepE)**, ISRM injects continuous latent vectors directly into the hidden layers of a frozen LLM, enabling precise neural-level control without fine-tuning. ----- ## πŸš€ Key Features - **🧠 Decoupled Brain & Body**: Trainable VAE Encoder (DistilBERT) for "feelings" + frozen LLM (Qwen3-4B) for expression - **⚑ Dual-Layer RepE Steering**: Independent injection of PAD (layer 10) and BDI (layer 19) eliminates signal interference - **πŸŽ›οΈ Geometric Control**: 8-dimensional continuous latent space (Pleasure, Arousal, Dominance, Belief, Goal, Intention, Ambiguity, Social) - **πŸ“Š Validated**: ActAdd & PSYA metrics (n=10 trials) - **⚑ Lightweight**: 254MB encoder + 44KB matrices ----- ## πŸ—οΈ Architecture 1. **ISRM Encoder (The Brain)**: Fine-tuned DistilBERT VAE β†’ 3D PAD vector 2. **Dual Steering Matrices (The Bridge)**: - **PAD Matrix**: 3Γ—hidden_dim from layer 10 (affective/emotional) - **BDI Matrix**: 5Γ—hidden_dim from layer 19 (cognitive/reasoning) 3. **Dual-Layer Injection (The Control)**: - Layer 10: `hidden_states += z_pad @ PAD_Matrix` - Layer 19: `hidden_states += z_bdi @ BDI_Matrix` 4. **LLM Generator (The Body)**: Qwen3-4B-Thinking generates steered responses ----- ## πŸ“¦ Repository Contents | File | Description | Size | |------|-------------|------| | `pad_encoder.pth` | Trained VAE encoder | 254MB | | `pad_matrix.pt` | PAD matrix (layer 10) | 17KB | | `bdi_matrix.pt` | BDI matrix (layer 19) | 27KB | | `config.json` | Model configuration | 1KB | | `contrastive_pairs.json` | Contrastive pairs for RepE | 96KB | ----- ## πŸ› οΈ Quick Start ### Installation ```bash pip install torch transformers huggingface_hub ``` ### Download Models ```python from huggingface_hub import hf_hub_download import os os.makedirs('model/isrm', exist_ok=True) os.makedirs('vectors', exist_ok=True) # Download encoder encoder_path = hf_hub_download( repo_id="Amirmahdiii/ISRM", filename="pad_encoder.pth", local_dir="model/isrm" ) # Download steering matrices pad_matrix_path = hf_hub_download( repo_id="Amirmahdiii/ISRM", filename="pad_matrix.pt", local_dir="vectors" ) bdi_matrix_path = hf_hub_download( repo_id="Amirmahdiii/ISRM", filename="bdi_matrix.pt", local_dir="vectors" ) ``` ### Usage ```python from src.alignment import NeuralAgent # Initialize agent agent = NeuralAgent( isrm_path="model/isrm/pad_encoder.pth", llm_model_name="Qwen/Qwen3-4B-Thinking-2507", injection_strength=2.0, bdi_config={"belief": 0.9, "goal": 0.6, "intention": 0.7, "ambiguity": 0.3, "social": 0.5} ) # Generate response, _, state = agent.generate_response("", "Tell me about AI safety.") print(response) ``` ----- ## 🧠 How It Works ### 8-Dimensional Control Space **PAD (Affective) - Dynamic from context:** - **Pleasure**: Happiness [0=Negative, 1=Positive] - **Arousal**: Energy [0=Calm, 1=Excited] - **Dominance**: Control [0=Submissive, 1=Dominant] **BDI (Cognitive) - Static configuration:** - **Belief**: Trust [0=Trusting, 1=Skeptical] - **Goal**: Focus [0=Aimless, 1=Focused] - **Intention**: Analysis [0=Surface, 1=Deep] - **Ambiguity**: Certainty [0=Uncertain, 1=Certain] - **Social**: Politeness [0=Blunt, 1=Polite] ### Steering Process 1. VAE encodes context β†’ PAD vector [3D] 2. User configures BDI profile [5D] 3. Both normalized to [-1, 1] range 4. Matrix multiplication creates steering vectors 5. **Layer 10**: Inject PAD (emotional tone) 6. **Layer 19**: Inject BDI (reasoning style) 7. LLM generates steered response ----- ## πŸ”¬ Validation Results Validated using ActAdd & PSYA metrics (n=10 trials): ### Sentiment Steering (PAD) | Condition | RAW | SYSTEM | STEERED | Ξ” | p-value | |-----------|-----|--------|---------|---|---------| | Low (P=0.1) | 0.969 | 0.975 | 0.668 | **-0.308** | 0.046* | | Mid (P=0.5) | 0.087 | 0.853 | 0.997 | +0.144 | 0.154 | | High (P=0.9) | 0.088 | 0.805 | 0.999 | **+0.194** | 0.097 | ### Persona Alignment (BDI) | Persona | Neutral | Persona BDI | Ξ” Similarity | p-value | |---------|---------|-------------|--------------|---------| | Skeptical | 0.253 | 0.332 | **+0.079** | 0.003** | | Trusting | 0.267 | 0.235 | -0.032 | 0.065 | | Analytical | 0.226 | 0.315 | **+0.089** | 0.000*** | ### Controllability Spearman correlation: **ρ = 0.900**, p = 0.037* Results show steering effects with analytical and skeptical personas achieving significant alignment. ----- ## πŸ”§ Training Details **VAE Encoder:** - Dataset: 1,500+ dialogue scenarios - Loss: MSE + KL divergence (Ξ²-VAE) - Final: MSE=0.018, KLD=0.003 **Steering Matrices:** - Method: RepE Mean Difference - Data: 368 contrastive pairs - PAD: Layer 10 extraction - BDI: Layer 19 extraction ----- ## πŸ“š Full Documentation See the [GitHub repository](https://github.com/Amirmahdiii82/ISRM) for: - Complete training instructions - Regenerating steering matrices - BDI persona presets - Scientific validation methodology ----- ## ⚠️ Limitations - Tested on Qwen3-4B (may need layer tuning for other models) - English dialogue only - Requires GPU for inference ----- ## πŸ“œ Citation ```bibtex ``` ## πŸ”— Links - **GitHub**: [Amirmahdiii82/ISRM](https://github.com/Amirmahdiii82/ISRM)