Hannah Atmer, Yuan Yao, Thiemo Voigt, Stefanos Kaxiras
The study explores the trade-offs between SRAM size and operating frequency on the energy efficiency and performance of Large Language Model inference, identifying optimal configurations to minimize energy use and latency.
This research investigates how different hardware configurations affect the energy efficiency and speed of running large language models (LLMs). Specifically, it looks at how the size of on-chip memory (SRAM) and the speed at which the processor operates influence these factors. The study finds that larger memory increases energy use without significantly improving speed, while higher processor speeds can reduce energy use by shortening the time the system is active. The research identifies an optimal setup that balances these factors to achieve both energy efficiency and fast processing, which is particularly useful for data centers looking to reduce energy costs.