clarify that ef=4 is the one from the paper
Browse files
README.md
CHANGED
|
@@ -7,6 +7,9 @@ language:
|
|
| 7 |
|
| 8 |
This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
|
| 9 |
|
|
|
|
|
|
|
|
|
|
| 10 |
## Model Details
|
| 11 |
|
| 12 |
A sparse autoencoder is a model which provides for two functions:
|
|
@@ -17,7 +20,6 @@ SAEs encode such that only a small number of latent dimensions (we call these fe
|
|
| 17 |
|
| 18 |
Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
|
| 19 |
|
| 20 |
-
We release SAEs with expansion factors 2, 4, and 8. For SAEs with expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
|
| 21 |
|
| 22 |
### Model Description
|
| 23 |
|
|
|
|
| 7 |
|
| 8 |
This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
|
| 9 |
|
| 10 |
+
In the preprint, we primarily study an SAE with expansion factor 4. Here we also release SAEs with expansion factors 2 and 8 to enable additional analyses. For expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
|
| 11 |
+
|
| 12 |
+
|
| 13 |
## Model Details
|
| 14 |
|
| 15 |
A sparse autoencoder is a model which provides for two functions:
|
|
|
|
| 20 |
|
| 21 |
Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
|
| 22 |
|
|
|
|
| 23 |
|
| 24 |
### Model Description
|
| 25 |
|