Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han
This paper introduces a new method for steering language models using logit-level interventions, which improves control over generated text without requiring model retraining or deep access to internal layers.
Language models, like those used in AI chatbots, often need to be guided to produce text that matches specific requirements, such as being polite or avoiding toxic language. Current methods for doing this have limitations, either needing complex access to the model's inner workings or not providing enough control. This research presents a novel approach that adjusts the model's output probabilities in a statistical manner during text generation, without needing to change the model itself. Tests show this method is effective across different tasks, like adjusting writing style or reducing toxicity, making it a versatile tool for improving AI communication.