sthyland commited on
Commit
6406e40
·
verified ·
1 Parent(s): c0c6434

clarify that ef=4 is the one from the paper

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -7,6 +7,9 @@ language:
7
 
8
  This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
9
 
 
 
 
10
  ## Model Details
11
 
12
  A sparse autoencoder is a model which provides for two functions:
@@ -17,7 +20,6 @@ SAEs encode such that only a small number of latent dimensions (we call these fe
17
 
18
  Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
19
 
20
- We release SAEs with expansion factors 2, 4, and 8. For SAEs with expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
21
 
22
  ### Model Description
23
 
 
7
 
8
  This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
9
 
10
+ In the preprint, we primarily study an SAE with expansion factor 4. Here we also release SAEs with expansion factors 2 and 8 to enable additional analyses. For expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
11
+
12
+
13
  ## Model Details
14
 
15
  A sparse autoencoder is a model which provides for two functions:
 
20
 
21
  Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
22
 
 
23
 
24
  ### Model Description
25