Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel
arXiv Preprint (2025)