Sparse Autoencoder for Qwen 3.5 9B — Interpretability Research

A sparse autoencoder (SAE) trained on the MLP activations of Qwen 3.5 9B base model, developed as part of a comparative mechanistic interpretability study.

This SAE was trained alongside a second SAE on a fine-tuned variant of the same model, enabling feature-level comparison between base and fine-tuned representations. This release contains the base model SAE only.

Model Details

Property	Value
Base model	Qwen 3.5 9B
Activation source	Layer 16 MLP output
Input dimension	4,096
SAE dimension	16,384 (4x expansion)
Active features	16,384 / 16,384 (0 dead features)
L1 coefficient	0.005
Final loss	0.0062
Training steps	12,208
Training tokens	~50M

Architecture

Standard sparse autoencoder with ReLU activation:

Input (4096) → Encoder (4096 → 16384) + ReLU → Decoder (16384 → 4096)

encoder.weight: (16384, 4096) — maps activations to sparse features
encoder.bias: (16384,) — encoder bias
decoder.weight: (4096, 16384) — reconstructs from sparse features
bias: (4096,) — pre-encoder bias (subtracted from input)

Usage

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_in=4096, d_sae=16384):
        super().__init__()
        self.bias = nn.Parameter(torch.zeros(d_in))
        self.encoder = nn.Linear(d_in, d_sae)
        self.decoder = nn.Linear(d_sae, d_in, bias=False)
    
    def forward(self, x):
        x_centered = x - self.bias
        z = torch.relu(self.encoder(x_centered))
        x_hat = self.decoder(z) + self.bias
        return x_hat, z

# Load
sae = SparseAutoencoder()
ckpt = torch.load("sae_base_best.pt", map_location="cpu")
sae.load_state_dict(ckpt["model_state_dict"])

# Use with Qwen 3.5 9B
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B", torch_dtype=torch.float16, device_map="auto")

# Hook into layer 16 MLP
def hook_fn(module, input, output):
    activations = output  # MLP output (batch, seq, 4096)
    reconstructed, features = sae(activations)
    # features is (batch, seq, 16384) — sparse feature activations
    return output

model.model.layers[16].mlp.register_forward_hook(hook_fn)

Research Context

This SAE is part of a comparative study examining how fine-tuning alters a model's internal representations at the feature level. Two SAEs were trained on identical architectures:

Base SAE (this release) — trained on activations from the unmodified Qwen 3.5 9B base model
Fine-tuned SAE (not released) — trained on activations from a personality-fine-tuned variant of the same model

By comparing the learned features between the two SAEs, we identified features unique to the fine-tuned model that correspond to personality-specific behaviors, including a phenomenon we term "memorization without grounding" — where the fine-tuned model recombines real memorized details into plausible but fictional scenarios.

Key findings:

All 16,384 features active in both SAEs (0 dead features)
Cosine similarity analysis between base and fine-tuned feature sets reveals distinct clusters unique to each model
The fine-tuned model develops features not present in the base that correlate with persona-consistent text generation

Training Details

Base model: Qwen 3.5 9B (Qwen/Qwen3.5-9B)
Activation source: MLP output at layer 16
Data: monology/pile-uncopyrighted, streamed, ~50M tokens
Context length: 512
Collection batch size: 8
Collection speed: ~5,440 tokens/sec
Activations: Saved in chunks to disk (~14GB per chunk) to avoid OOM, then streamed during SAE training
SAE training batch size: 4,096
Optimizer: Adam, lr=5e-5
Loss: MSE reconstruction + L1 sparsity penalty (λ=0.005)
Epochs: 1
Hardware: NVIDIA RTX 4090 (24GB)
Total time: ~4 hours (collection + training)

Files

sae_base_best.pt — Best checkpoint (lowest loss)
sae_base_final.pt — Final checkpoint (last step)
config.json — Model configuration
analysis/feature_stats.json — Per-feature activation statistics

License

Apache 2.0

Citation

@misc{kroonen2026sae-qwen,
  title={Sparse Autoencoder for Qwen 3.5 9B: Comparative Mechanistic Interpretability},
  author={Kroonen AI},
  year={2026},
  publisher={Kroonen AI Inc.}
}

Blog Post

Read the full write-up: Mapping the Mind of Qwen 3.5 9B

About

Built by Kroonen AI Inc.

Downloads last month: 96

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kroonen-ai/sae-qwen3.5-9b

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(88)

this model