Sparse Autoencoder for Qwen 3.5 9B β Interpretability Research
A sparse autoencoder (SAE) trained on the MLP activations of Qwen 3.5 9B base model, developed as part of a comparative mechanistic interpretability study.
This SAE was trained alongside a second SAE on a fine-tuned variant of the same model, enabling feature-level comparison between base and fine-tuned representations. This release contains the base model SAE only.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen 3.5 9B |
| Activation source | Layer 16 MLP output |
| Input dimension | 4,096 |
| SAE dimension | 16,384 (4x expansion) |
| Active features | 16,384 / 16,384 (0 dead features) |
| L1 coefficient | 0.005 |
| Final loss | 0.0062 |
| Training steps | 12,208 |
| Training tokens | ~50M |
Architecture
Standard sparse autoencoder with ReLU activation:
Input (4096) β Encoder (4096 β 16384) + ReLU β Decoder (16384 β 4096)
encoder.weight: (16384, 4096) β maps activations to sparse featuresencoder.bias: (16384,) β encoder biasdecoder.weight: (4096, 16384) β reconstructs from sparse featuresbias: (4096,) β pre-encoder bias (subtracted from input)
Usage
import torch
import torch.nn as nn
class SparseAutoencoder(nn.Module):
def __init__(self, d_in=4096, d_sae=16384):
super().__init__()
self.bias = nn.Parameter(torch.zeros(d_in))
self.encoder = nn.Linear(d_in, d_sae)
self.decoder = nn.Linear(d_sae, d_in, bias=False)
def forward(self, x):
x_centered = x - self.bias
z = torch.relu(self.encoder(x_centered))
x_hat = self.decoder(z) + self.bias
return x_hat, z
# Load
sae = SparseAutoencoder()
ckpt = torch.load("sae_base_best.pt", map_location="cpu")
sae.load_state_dict(ckpt["model_state_dict"])
# Use with Qwen 3.5 9B
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B", torch_dtype=torch.float16, device_map="auto")
# Hook into layer 16 MLP
def hook_fn(module, input, output):
activations = output # MLP output (batch, seq, 4096)
reconstructed, features = sae(activations)
# features is (batch, seq, 16384) β sparse feature activations
return output
model.model.layers[16].mlp.register_forward_hook(hook_fn)
Research Context
This SAE is part of a comparative study examining how fine-tuning alters a model's internal representations at the feature level. Two SAEs were trained on identical architectures:
- Base SAE (this release) β trained on activations from the unmodified Qwen 3.5 9B base model
- Fine-tuned SAE (not released) β trained on activations from a personality-fine-tuned variant of the same model
By comparing the learned features between the two SAEs, we identified features unique to the fine-tuned model that correspond to personality-specific behaviors, including a phenomenon we term "memorization without grounding" β where the fine-tuned model recombines real memorized details into plausible but fictional scenarios.
Key findings:
- All 16,384 features active in both SAEs (0 dead features)
- Cosine similarity analysis between base and fine-tuned feature sets reveals distinct clusters unique to each model
- The fine-tuned model develops features not present in the base that correlate with persona-consistent text generation
Training Details
- Base model: Qwen 3.5 9B (
Qwen/Qwen3.5-9B) - Activation source: MLP output at layer 16
- Data: monology/pile-uncopyrighted, streamed, ~50M tokens
- Context length: 512
- Collection batch size: 8
- Collection speed: ~5,440 tokens/sec
- Activations: Saved in chunks to disk (~14GB per chunk) to avoid OOM, then streamed during SAE training
- SAE training batch size: 4,096
- Optimizer: Adam, lr=5e-5
- Loss: MSE reconstruction + L1 sparsity penalty (Ξ»=0.005)
- Epochs: 1
- Hardware: NVIDIA RTX 4090 (24GB)
- Total time: ~4 hours (collection + training)
Files
sae_base_best.ptβ Best checkpoint (lowest loss)sae_base_final.ptβ Final checkpoint (last step)config.jsonβ Model configurationanalysis/feature_stats.jsonβ Per-feature activation statistics
License
Apache 2.0
Citation
@misc{kroonen2026sae-qwen,
title={Sparse Autoencoder for Qwen 3.5 9B: Comparative Mechanistic Interpretability},
author={Kroonen AI},
year={2026},
publisher={Kroonen AI Inc.}
}
Blog Post
Read the full write-up: Mapping the Mind of Qwen 3.5 9B
About
Built by Kroonen AI Inc.
- Downloads last month
- 96