Sparse Autoencoder for Qwen 3.5 9B β€” Interpretability Research

A sparse autoencoder (SAE) trained on the MLP activations of Qwen 3.5 9B base model, developed as part of a comparative mechanistic interpretability study.

This SAE was trained alongside a second SAE on a fine-tuned variant of the same model, enabling feature-level comparison between base and fine-tuned representations. This release contains the base model SAE only.

Model Details

Property Value
Base model Qwen 3.5 9B
Activation source Layer 16 MLP output
Input dimension 4,096
SAE dimension 16,384 (4x expansion)
Active features 16,384 / 16,384 (0 dead features)
L1 coefficient 0.005
Final loss 0.0062
Training steps 12,208
Training tokens ~50M

Architecture

Standard sparse autoencoder with ReLU activation:

Input (4096) β†’ Encoder (4096 β†’ 16384) + ReLU β†’ Decoder (16384 β†’ 4096)
  • encoder.weight: (16384, 4096) β€” maps activations to sparse features
  • encoder.bias: (16384,) β€” encoder bias
  • decoder.weight: (4096, 16384) β€” reconstructs from sparse features
  • bias: (4096,) β€” pre-encoder bias (subtracted from input)

Usage

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_in=4096, d_sae=16384):
        super().__init__()
        self.bias = nn.Parameter(torch.zeros(d_in))
        self.encoder = nn.Linear(d_in, d_sae)
        self.decoder = nn.Linear(d_sae, d_in, bias=False)
    
    def forward(self, x):
        x_centered = x - self.bias
        z = torch.relu(self.encoder(x_centered))
        x_hat = self.decoder(z) + self.bias
        return x_hat, z

# Load
sae = SparseAutoencoder()
ckpt = torch.load("sae_base_best.pt", map_location="cpu")
sae.load_state_dict(ckpt["model_state_dict"])

# Use with Qwen 3.5 9B
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B", torch_dtype=torch.float16, device_map="auto")

# Hook into layer 16 MLP
def hook_fn(module, input, output):
    activations = output  # MLP output (batch, seq, 4096)
    reconstructed, features = sae(activations)
    # features is (batch, seq, 16384) β€” sparse feature activations
    return output

model.model.layers[16].mlp.register_forward_hook(hook_fn)

Research Context

This SAE is part of a comparative study examining how fine-tuning alters a model's internal representations at the feature level. Two SAEs were trained on identical architectures:

  1. Base SAE (this release) β€” trained on activations from the unmodified Qwen 3.5 9B base model
  2. Fine-tuned SAE (not released) β€” trained on activations from a personality-fine-tuned variant of the same model

By comparing the learned features between the two SAEs, we identified features unique to the fine-tuned model that correspond to personality-specific behaviors, including a phenomenon we term "memorization without grounding" β€” where the fine-tuned model recombines real memorized details into plausible but fictional scenarios.

Key findings:

  • All 16,384 features active in both SAEs (0 dead features)
  • Cosine similarity analysis between base and fine-tuned feature sets reveals distinct clusters unique to each model
  • The fine-tuned model develops features not present in the base that correlate with persona-consistent text generation

Training Details

  • Base model: Qwen 3.5 9B (Qwen/Qwen3.5-9B)
  • Activation source: MLP output at layer 16
  • Data: monology/pile-uncopyrighted, streamed, ~50M tokens
  • Context length: 512
  • Collection batch size: 8
  • Collection speed: ~5,440 tokens/sec
  • Activations: Saved in chunks to disk (~14GB per chunk) to avoid OOM, then streamed during SAE training
  • SAE training batch size: 4,096
  • Optimizer: Adam, lr=5e-5
  • Loss: MSE reconstruction + L1 sparsity penalty (Ξ»=0.005)
  • Epochs: 1
  • Hardware: NVIDIA RTX 4090 (24GB)
  • Total time: ~4 hours (collection + training)

Files

  • sae_base_best.pt β€” Best checkpoint (lowest loss)
  • sae_base_final.pt β€” Final checkpoint (last step)
  • config.json β€” Model configuration
  • analysis/feature_stats.json β€” Per-feature activation statistics

License

Apache 2.0

Citation

@misc{kroonen2026sae-qwen,
  title={Sparse Autoencoder for Qwen 3.5 9B: Comparative Mechanistic Interpretability},
  author={Kroonen AI},
  year={2026},
  publisher={Kroonen AI Inc.}
}

Blog Post

Read the full write-up: Mapping the Mind of Qwen 3.5 9B

About

Built by Kroonen AI Inc.

Downloads last month
96
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kroonen-ai/sae-qwen3.5-9b

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(88)
this model