Alex Makelov's picture

Alex Makelov

amakelov

https://amakelov.github.io

AI & ML interests

Interpretability

Organizations

None yet

authored a paper 10 months ago

Towards Deep Learning Models Resistant to Adversarial Attacks

Paper • 1706.06083 • Published Jun 19, 2017

authored 2 papers about 1 year ago

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Paper • 2311.17030 • Published Nov 28, 2023

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Paper • 2405.08366 • Published May 14, 2024 • 2