Towards Deep Learning Models Resistant to Adversarial Attacks Paper • 1706.06083 • Published Jun 19, 2017
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching Paper • 2311.17030 • Published Nov 28, 2023
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Paper • 2405.08366 • Published May 14, 2024 • 2