Jyotin Goel, Souvik Maji, Pratik Mazumder
The paper introduces an adaptive regularization framework that maintains the safety of language models during fine-tuning without compromising their utility.
Language models, which are designed to follow instructions safely, can become less safe when they are fine-tuned. This process can be even more problematic if the updates are adversarial. The study presents a new training method that adapts to potential safety risks during fine-tuning, ensuring that the models remain aligned with safety standards. By using either a judge-based system or a classifier to predict risk, the method allows for safer updates while maintaining the model's performance and without adding extra computational cost during use.