单词嵌入-自然语言的连续空间表示
DOI:
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Word Embedding: Continuous Space Represengtation for Natural Language
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    单词嵌入是指运用机器学习的方法,将位于高维离散空间(维数为词典单词数目)中的每个单词映射到低维连续空间的实数向量的技术。在很多文本处理的任务中,单词嵌入提供了更好的语义级别的单词特征表示,从而为文本处理任务带来了诸多便利。同时,大数据时代海量的未标注文本数据,以及以深度学习为代表的机器学习技术的发展使高效的单词嵌入技术成为可能。本文将给出单词嵌入的定义以及实际意义,同时将综述目前单词嵌入技术的几种典型方法,包括基于神经网络的方法、基于受限玻尔兹曼机的方法以及基于单词与上下文共生矩阵分解的方法。本文将详细介绍不同模型的数学定义、物理意义以及训练方法,并给出他们之间的比较。

    Abstract:

    Word embedding refers to a machine learning technology which maps search of word lying in high-dimensional discrete space (with dimension to be the number of all words) to a real number vector in low-dimensional continuous space. Word embedding provides better semantic word representations, and thus greatly benefits text processing tasks. Meanwhile, huge amount of unlabeled text data, together with the development of advanced machine learning techniques such as deep learning, make it possible to effectively obtain high quality word embeddings. Besides, the definition and practical value of word embedding are given, and some classical methods are also reviewed to obtain word embedding, including neural network based methods, restricted Boltzmann machine based methods, and methods based on factorization of context co-occurrence matrix. For each model, its mathematical definition, physical meaning are introduced in detail, as well as training procedure. In addition, all these methods are compared in the aforementioned three aspects.

    参考文献
    相似文献
    引证文献
引用本文

陈恩红 丘思语 许畅 田飞 刘铁岩.单词嵌入-自然语言的连续空间表示[J].数据采集与处理,2014,29(1):19-29

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2014-03-14