Gregory Bolet, Giorgis Georgakoudis, Konstantinos Parasyris, Harshitha Menon, Niranjan Hasabnis, Kirk W. Cameron, Gal Oren
The gpuFLOPBench benchmark evaluates Large Language Models' (LLMs) ability to predict FLOP counts for CUDA kernels, highlighting their challenges in reasoning about code complexity without execution.
Developers working with GPUs need to anticipate how software will perform before running it, especially when dealing with complex calculations. This research introduces a new benchmark, gpuFLOPBench, to test how well large language models (LLMs) can predict computational workload for GPU code without actually executing it. The study found that while LLMs can handle simple cases well, they struggle with complex scenarios where performance depends on hidden factors. This highlights a current limitation of AI tools in understanding the intricacies of GPU performance.