PaperPulse logo
FeedTopicsAI Researcher FeedBlogPodcastAccount

Stay Updated

Get the latest research delivered to your inbox

Platform

  • Home
  • About Us
  • Search Papers
  • Research Topics
  • Researcher Feed

Resources

  • Newsletter
  • Blog
  • Podcast
PaperPulse•

AI-powered research discovery platform

© 2024 PaperPulse. All rights reserved.

Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

ArXivSource

Sachin Pawar, Manoj Apte, Kshitij Jadhav, Girish Keshav Palshikar, Nitin Ramrakhiyani

cs.CL
|
Dec 26, 2025
3 views

One-line Summary

The paper investigates how breaking natural words into multiple tokens in large language models (LLMs) negatively affects their performance on NLP tasks.

Plain-language Overview

When training large language models (LLMs), text is split into tokens based on the model's vocabulary, which can sometimes break natural words into parts. This study examines whether such tokenization affects the model's ability to perform tasks like understanding language or generating text. The authors propose a way to measure how 'bad' tokenization is and find that poor tokenization can indeed harm the model's performance. They test this hypothesis across various models and tasks, confirming their theory with statistical evidence.

Technical Details