Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin
The paper introduces Concept DAS (CDAS), a novel intervention-based model steering method that uses distribution matching to achieve more faithful and stable control compared to traditional preference-optimization methods.
This research presents a new technique for steering machine learning models, called Concept DAS (CDAS), which aims to improve control over model behavior without the downsides of traditional methods like overfitting. Unlike previous approaches that often enforce external preferences, CDAS uses a method called distributed interchange intervention to align model outputs more naturally. This technique is shown to be particularly effective in safety-related applications, such as overcoming unwanted model refusals or neutralizing harmful biases, while maintaining the overall utility of the model. Although CDAS doesn't always outperform traditional methods, it shows promise, especially as models scale up in size.