数据的概率建模综述:从传统到现代
作者:
作者单位:

上海交通大学计算机学院,上海 200240

作者简介:

通讯作者:

基金项目:

国家自然科学基金(62176155)。


A Survey on Probabilistic Modeling of Data: From Traditional to Modern
Author:
Affiliation:

School of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China

Fund Project:

National Natural Science Foundation of China (No. 62176155).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    人工智能技术发展日新月异,各类模型、算法及其应用领域受到较大关注。数据的概率建模是人工智能和机器学习的核心问题,但是其关注度普遍较低。这一方面是由于概率建模理论抽象,另一方面是相关综述较少。然而人工智能领域的原创性突破大多都与数据概率建模有关,因此本文以数据的概率建模为主线,对机器学习中从传统到现代的主流方法进行综述,从高斯混合模型、期望最大化(Expectation-maximization,EM)算法和变分推理等传统方法到变分自编码器、生成对抗网、分数匹配、扩散模型、归一化流和流匹配等现代方法都统一到数据的概率建模框架下。这些方法虽然提出的时间跨度很大,解决的问题有所不同,但它们都可以解释为最大似然估计或分数匹配框架,区别在于对数据及模型的假设不同。因此,本文构建了一种对从传统机器学习到最新生成模型的统一理解方式,将概率建模方法分为基于最大似然估计的方法、基于分数匹配的方法和基于流的方法,揭示了它们之间的内在联系,为人工智能生成方法的进一步发展提供了理论基础方面的解读。

    Abstract:

    Probabilistic modeling of data is the core in machine learning and modern generative AI. This survey reviews the methodological evolution from traditional statistical formulations to recent deep generative frameworks under a unified view of probability distribution learning. Representative methods are organized into three connected routes: Maximum-likelihood-based modeling, score-matching-based modeling, and flow-based modeling. On the traditional side, the survey revisits Gaussian assumptions, Gaussian mixture models, expectation-maximization (EM) algorithms, and variational inference, emphasizing how tractability-flexibility trade-offs shape model design. On the modern side, it discusses variational autoencoders (VAEs), generative adversarial net (GAN)-related generative mechanisms, diffusion probabilistic models, score-based stochastic differential equation (SDE) formulations, normalizing flows, and flow matching, with focus on objective functions, parameterization choices, and sampling dynamics. A structured comparison is provided from the perspectives of explicit likelihood, trajectory modeling, computational efficiency, controllability, and deployment stability. To bridge methodology and practice, the paper summarizes benchmark-oriented observations and application trends in image generation, video and audio synthesis, inverse problems, and science-and-control scenarios. It also identifies practical bottlenecks, including dependence on high-quality large-scale data, limited semantic operability of latent representations, and inference latency caused by multi-step sampling. Finally, future directions are discussed around coordinated advances in path design, training objectives, numerical solvers, and guidance strategies, together with unified evaluation over quality, efficiency, safety, and compliance for trustworthy large-scale deployment.

    参考文献
    相似文献
    引证文献
引用本文

卢宏涛,胡宇庭.数据的概率建模综述:从传统到现代[J].数据采集与处理,2026,(2):461-488

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2026-01-30
  • 最后修改日期:2026-03-08
  • 录用日期:
  • 在线发布日期: 2026-04-15