Jonathan Roberts, Kai Han, Samuel Albanie
This study analyzes the variability in tokenization across different models and text domains, revealing that token length heuristics are often overly simplistic.
In the world of large language models (LLMs), tokens are a key unit used to measure and compare model inputs and outputs. However, the process of tokenization, which converts text into these tokens, can differ greatly between models and types of text. This study examines how tokenization varies and finds that common beliefs about token lengths are often too simple. By understanding these differences, we can better interpret and compare the performance and costs of different language models.