Soma Sato, Ryohei Sasano
This paper investigates how language models acquire character-level information, identifying key factors related to tokenization and semantic associations.
Language models, which are typically trained to understand and generate text, seem to have a knack for picking up on details at the level of individual characters, even though they're not explicitly trained to do so. The researchers set out to understand how this happens by comparing models trained with different settings and tokenizers. They found that the way words are broken down into smaller parts (tokenization) and the inherent meanings of these smaller parts play significant roles in how models learn character-level information. This study helps us understand the hidden workings of language models better.