Post
636
I ran the Anthropic Misalignment Framework for a few top models and added it to a dataset:
cfahlgren1/anthropic-agentic-misalignment-results
You can read the reasoning traces of the models trying to blackmail the user and perform other actions. It's very interesting!!
You can read the reasoning traces of the models trying to blackmail the user and perform other actions. It's very interesting!!