microsoft
/

maira-2-sae

sthyland commited on Jul 23

Commit

6406e40

verified ·

1 Parent(s): c0c6434

clarify that ef=4 is the one from the paper

Files changed (1) hide show

README.md CHANGED Viewed

@@ -7,6 +7,9 @@ language:
 This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
 ## Model Details
 A sparse autoencoder is a model which provides for two functions:
@@ -17,7 +20,6 @@ SAEs encode such that only a small number of latent dimensions (we call these fe
 Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
-We release SAEs with expansion factors 2, 4, and 8. For SAEs with expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
 ### Model Description

 This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
+In the preprint, we primarily study an SAE with expansion factor 4. Here we also release SAEs with expansion factors 2 and 8 to enable additional analyses. For expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
 ## Model Details
 A sparse autoencoder is a model which provides for two functions:
 Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
 ### Model Description