42digest首页
实现用有限样本合成高维表格数据

Towards Synthesizing High-Dimensional Tabular Data with Limited Samples

Zuqing Li, Junhao Gan, Jianzhong Qi

arXiv
2025年3月9日

基于扩散的表格数据合成模型产生了有希望的结果。 然而,当数据维度增加时,现有模型往往会退化,并且可能比更简单的非扩散模型执行得更差。 这是因为在高维空间中有限的训练样本往往会阻碍生成模型准确地捕获分布。 为了减轻学习信号不足并稳定在这种条件下的训练,我们提出了CtrTab,一种条件控制的扩散模型,在训练过程中注入扰动的地面真相样本作为辅助输入。 这种设计引入了模型对控制信号的灵敏度的隐式L2正则化,提高了高维、低数据场景中的鲁棒性和稳定性。 多个数据集的实验结果表明,CtrTab优于最先进的模型,平均性能差距超过90%。

Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-c...