János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy
New probe architectures improve misuse mitigation for language models like Gemini by handling long-context inputs and adapting to distribution shifts, enhancing safety and efficiency.
As language models become more powerful, it's important to prevent their misuse. One approach to this is using 'probes' to detect harmful uses, but these probes struggle when the input data changes significantly. This research introduces new probe designs that better handle long and complex inputs, making them more reliable in real-world applications. The study also shows that combining these probes with other techniques can improve accuracy and efficiency, leading to successful deployment in Google's Gemini model.