Yafeng Tang, Xiaoou Ding, Jianzhuo Du, Zishuo Yan, Zhuang Ma, Zheng Liang, Zekai Qian, Hongzhi Wang
The Diversity-Aware Tabular data gEnerator (DATE) framework improves tabular data generation by partitioning data into diverse subsets and using LLMs with decision tree reasoning to generate high-quality data, outperforming existing methods significantly.
Generating high-quality tabular data is crucial for machine learning, but real-world data often have diverse distributions that make this challenging. The new DATE framework addresses this by dividing the data into distinct subsets and using advanced language models to generate data for each subset. This method balances the diversity and quality of the generated data better than existing methods. Experiments show that DATE significantly reduces error rates and enhances machine learning models' performance.