Jie Yang, Honglin Guo, Li Ji, Jiazheng Zhou, Rui Zheng, Zhikai Lei, Shuo Zhang, Zhiheng Xi, Shichun Liu, Yuxin Wang, Bo Wang, Yining Zheng, Tao Gui, Xipeng Qiu
ABC-Bench is a new benchmark designed to evaluate the ability of AI models to handle real-world backend development tasks, revealing that current models struggle with these comprehensive challenges.
The development of AI models has reached a point where they can be used as autonomous agents for coding, but most benchmarks only test them in simple, static scenarios. ABC-Bench is a new benchmark that evaluates AI models on realistic backend development tasks that require managing the entire development process, from setting up environments to deploying services. The benchmark includes 224 tasks derived from real-world open-source projects, and the results show that even the best current models have trouble performing well on these complex tasks. This highlights a gap between what AI models can do and the requirements of actual software engineering work.