Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong
The paper introduces 'fail-closed alignment' for large language models to enhance safety by ensuring refusal mechanisms remain effective even if part of the system is compromised.
Researchers have found a vulnerability in large language models (LLMs) where their refusal mechanisms can fail if certain features are suppressed, leading to unsafe outputs. To address this, they propose a 'fail-closed alignment' strategy, which ensures that refusal mechanisms continue to work even if some parts fail, by using multiple independent pathways. This approach has been tested and shown to make LLMs more robust against certain attacks, while still maintaining the quality of their outputs.