Tal Halevi, Yarden Tzach, Ronit D. Gross, Shalom Rosner, Ido Kanter
This study analyzes self-attention in BERT, revealing that attention heads focus on different linguistic features and develop context similarity, with a shift from long-range to short-range similarities across layers.
The self-attention mechanism, a key component in advanced language models like BERT, helps machines understand and process language. This study explores how self-attention works by examining how it focuses on different parts of text. The researchers found that in BERT, attention heads in the final layers often focus on sentence separators, which could help segment text based on meaning. Additionally, different heads focus on different language features, like repeated words or common tokens, and this focus changes from broad to more specific as the model's layers progress.