Sachin Pawar, Manoj Apte, Kshitij Jadhav, Girish Keshav Palshikar, Nitin Ramrakhiyani
The paper investigates how breaking natural words into multiple tokens in large language models (LLMs) negatively affects their performance on NLP tasks.
When training large language models (LLMs), text is split into tokens based on the model's vocabulary, which can sometimes break natural words into parts. This study examines whether such tokenization affects the model's ability to perform tasks like understanding language or generating text. The authors propose a way to measure how 'bad' tokenization is and find that poor tokenization can indeed harm the model's performance. They test this hypothesis across various models and tasks, confirming their theory with statistical evidence.