Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen
BEDTime introduces a unified benchmark for evaluating time series description models, highlighting the need for specialized architectures and revealing the strengths and weaknesses of various model types.
The paper introduces BEDTime, a new benchmark designed to evaluate how well different models can describe time series data using natural language. It focuses on three tasks: recognizing whether a statement about a time series is true or false, choosing the correct description from multiple options, and generating an open-ended description. The study finds that models specifically designed for time series data tend to perform better than general language models, though there is still room for improvement. This benchmark helps researchers compare models more directly and understand which features contribute to their performance.