sthyland commited on
Commit
d445365
·
verified ·
1 Parent(s): 7ae648e

add model card

Browse files
Files changed (1) hide show
  1. README.md +160 -3
README.md CHANGED
@@ -1,3 +1,160 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card for MAIRA-2-SAE
2
+
3
+ This is a collection of sparse autoencoders (SAEs) trained on the residual stream of layer 15 of [MAIRA-2](https://huggingface.co/microsoft/maira-2), and described in the preprint ['Insights into a radiology-specialised multimodal large language model with sparse autoencoders'](https://arxiv.org/abs/2507.12950), presented at the [Actionable Interpretability Workshop @ ICML 2025](https://actionable-interpretability.github.io/).
4
+
5
+ ## Model Details
6
+
7
+ A sparse autoencoder is a model which provides for two functions:
8
+ - Encoding some input (in this case, model activations) into a "latent space" (in this case, one which is higher dimensional than its input)
9
+ - Decoding from the "latent space" back into the input space
10
+
11
+ SAEs encode such that only a small number of latent dimensions (we call these features) are active for any input.
12
+
13
+ Specifically these are Matryoshka BatchTopK SAEs, which are described in [Learning Multi-Level Features with Matryoshka Sparse Autoencoders](https://arxiv.org/abs/2503.17547). Importantly, the decoder is linear, hence the SAE serves to reconstruct model activations as a linear combination of (putatively) interpretable feature directions.
14
+
15
+ We release SAEs with expansion factors 2, 4, and 8. For SAEs with expansion factors 2 and 4, we also provide LLM-generated interpretations of each feature and their corresponding interpretability scores.
16
+
17
+ ### Model Description
18
+
19
+ <!-- Provide a longer summary of what this model is. -->
20
+
21
+ - **Developed by:** Microsoft Research Health Futures
22
+ - **Model type:** Autoencoder
23
+ - **License:** MIT
24
+
25
+
26
+ ## Uses
27
+
28
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
29
+ These SAEs are shared for research purposes only. Their intended use is interpretability analysis of MAIRA-2. Given MAIRA-2 and a data example (e.g. from MIMIC-CXR), one can retrieve the activation strength of all SAE features. This can be used to ascribe interpretations to SAE features, or to use such feature interpretations to analyse the workings of MAIRA-2.
30
+
31
+ ### Direct Use
32
+
33
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
34
+ Use of these SAEs requires access to MAIRA-2 - see the [MAIRA-2 model card](https://huggingface.co/microsoft/maira-2) for details.
35
+ Assuming one has extracted the residual stream from layer 15 of MAIRA-2, and processed the activations as described in [the preprint](https://arxiv.org/abs/2507.12950), the SAE can be used to encode this representation into a higher-dimensional space more suitable for interpretation.
36
+ We provide a usage example below.
37
+
38
+ Analyses specifically of the SAEs are also possible, for example by inspecting the learned dictionary elements (the decoder layer). In this case, the provided feature interpretations may be useful, however we stress that only a subset of features have meaningful interpretations.
39
+
40
+ ### Out-of-Scope Use
41
+
42
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
43
+ These SAEs were trained on MAIRA-2 activations collected from the MIMIC-CXR findings generation subset of the original MAIRA-2 training dataset. Hence, they may not perform well (in the sense of reconstruction) on other datasets or tasks either within MAIRA-2's training distribution (e.g. PadChest, [PadChest-GR](https://ai.nejm.org/doi/full/10.1056/AIdbp2401120)), or datasets MAIRA-2 was not trained on. Any non-research use of these SAEs is out of scope.
44
+
45
+ ## Bias, Risks, and Limitations
46
+
47
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
48
+ As above, the SAEs were trained and interpreted using the MIMIC-CXR subset of the MAIRA-2 training data. MIMIC-CXR represents a cohort of patients from a single hospital in the USA. Inferences made about MAIRA-2 using these SAEs will necessarily be limited to concepts which could plausibly be discovered using MIMIC-CXR.
49
+
50
+ ## How to Get Started with the Model
51
+
52
+ ### Setup
53
+
54
+ Install [dictionary_learning](https://github.com/saprmarks/dictionary_learning):
55
+ `pip install dictionary-learning`
56
+ or
57
+ `uv add dictionary-learning`.
58
+
59
+ We used `dictionary_learning` as a submodule at commit `07975f7`, which is version `0.1.0`.
60
+
61
+ #### Download weights from the hub
62
+
63
+ Option 1: Download a single SAE with specified expansion factor
64
+
65
+ ```python
66
+ from huggingface_hub import hf_hub_download
67
+ expansion_factor = 2
68
+ model_name = f"layer15_res_matryoshka_k256_ef{expansion_factor}.pt"
69
+ # Each expansion factor has its own subfolder
70
+ ef_subfolder = f"ef{expansion_factor}"
71
+ # Specify your own local download directory here if you want
72
+ local_dir = "./"
73
+ local_path = hf_hub_download(repo_id="microsoft/maira-2-sae", subfolder=ef_subfolder, filename=model_name, local_dir=local_dir)
74
+ ```
75
+
76
+ Option 2: Download all SAEs
77
+
78
+ ```python
79
+ from huggingface_hub import snapshot_download
80
+ # Specify your own local download directory here if you want
81
+ local_dir = "./"
82
+ snapshot_download(repo_id="microsoft/maira-2-sae", local_dir=local_dir)
83
+ ```
84
+
85
+ ### Use SAE to get activations
86
+
87
+ ```python
88
+ import torch
89
+ from dictionary_learning.trainers.matryoshka_batch_top_k import MatryoshkaBatchTopKSAE
90
+
91
+ # local_path is the path to the dictionary weights (.pt file), however you downloaded them
92
+ ae = MatryoshkaBatchTopKSAE.from_pretrained(local_path)
93
+
94
+ # get NN activations using your preferred method: hooks, transformer_lens, nnsight, etc. ...
95
+ # for now we'll just use random activations
96
+ activation_dim = 4096
97
+ activations = torch.randn(64, activation_dim)
98
+ features = ae.encode(activations) # get features from activations
99
+ reconstructed_activations = ae.decode(features)
100
+
101
+ # you can also just get the reconstruction ...
102
+ reconstructed_activations = ae(activations)
103
+ # ... or get the features and reconstruction at the same time
104
+ reconstructed_activations, features = ae(activations, output_features=True)
105
+ ```
106
+
107
+ ## Training Details
108
+
109
+ ### Training Data
110
+
111
+ We collected activations from the residual stream of layer 15 of MAIRA-2 using the MIMIC-CXR subset of the [MAIRA-2 training/validation set](https://arxiv.org/abs/2406.04449). As detailed in [our preprint](https://arxiv.org/abs/2507.12950), we collected activations from all tokens in the sequence excluding image tokens and boilerplate/templated subsequences. This resulted in 34.7M tokens for training, and 1.7M for validation (respecting the splits used to train MAIRA-2). Following [Gao et al.](https://arxiv.org/abs/2406.04093), we scaled all tokens with a normalization factor of 22.34, representing the mean l2 norm of the training samples.
112
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
113
+
114
+ ### Training Procedure
115
+
116
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
117
+ We trained the SAEs using the open-source [dictionary learning](https://github.com/saprmarks/dictionary_learning) library, using the `MatryoshkaBatchTopKTrainer`.
118
+
119
+ #### Training Hyperparameters
120
+
121
+ - Matryoshka group fractions: [1/2, 1/4, 1/8, 1/16, 1/16]
122
+ - k (mean l0 per batch): 256
123
+ - Batch size: 8192
124
+ - Epochs: 1
125
+ - Expansion factors: 2, 4, 8 (multiple models)
126
+
127
+ Further hyperparameters are listed in [the preprint](https://arxiv.org/abs/2507.12950).
128
+
129
+ ## Automated Interpretation
130
+
131
+ For SAEs with expansion factor 2 and 4, we also provide automatically-generated interpretations of each feature, again as described in [our preprint](https://arxiv.org/abs/2507.12950). These are the files `autointerp_layer15_res_matryoshka_k256_ef{2,4}.csv`.
132
+
133
+ These interpretations were generated by showing GPT-4o data samples selected based on the activation strength for that feature. Note that we did not show GPT-4o the images, so these interpretations are necessarily limited. We did not run full automated interpretation on expansion factor 8 due to the large number of features (32,768).
134
+
135
+ We scored the quality of the interpretations using the detection scoring approach from [Automatically Interpreting Millions of Features in Large Language Models](https://arxiv.org/abs/2410.13928), wherein the interpretation is provided to a LLM judge (again, GPT-4o) to predict whether a new sample will activate the feature. We provide binary classification metrics (accuracy, precision, recall, and F1) for each feature for both the 'train' samples (samples used to generate the interpretation) and validation (held-out samples) as a measure of interpretability. We also provide statistics on how often each feature was observed to activate in a random subset of the training set (n), to facilitate further analyses.
136
+
137
+ ## Citation
138
+
139
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
140
+
141
+ **BibTeX:**
142
+
143
+ ```
144
+ @article{maira2sae,
145
+ title={Insights into a radiology-specialised multimodal large language model with sparse autoencoders},
146
+ author={Kenza Bouzid and Shruthi Bannur and Felix Meissen and Daniel Coelho de Castro and Anton Schwaighofer and Javier Alvarez-Valle and Stephanie L. Hyland},
147
+ journal={Actionable Interpretability Workshop @ ICML 2025},
148
+ year={2025},
149
+ url={https://arxiv.org/abs/2507.12950}
150
+ }
151
+ ```
152
+
153
+ **APA:**
154
+
155
+ > Bouzid, K., Bannur, S., Meissen, F., Coelho de Castro, D., Schwaighofer, A., Alvarez-Valle, J., & Hyland, S. L. (2025). Insights into a radiology-specialised multimodal large language model with sparse autoencoders. *Actionable Interpretability Workshop @ ICML 2025*. [arXiv](https://arxiv.org/abs/2507.12950).
156
+
157
+ ## Model Card Contact
158
+
159
+ - Stephanie Hyland ([`[email protected]`](mailto:[email protected]))
160
+ - Kenza Bouzid ([`[email protected]`](mailto:[email protected]))