基于预训练与音素字节对编码的越南语识别
作者:
作者单位:

中国科学技术大学电子工程与信息科学系,合肥 230027

作者简介:

通讯作者:

基金项目:

国家自然科学基金(U1836219)。


Vietnamese Speech Recognition Based on Pre-training and Phone-Based Byte-Pair Encoding
Author:
Affiliation:

Department of Electronic Engineering & Information Science, University of Science and Technology of China, Hefei 230027, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    基于无监督预训练技术的wav2vec 2.0在许多低资源语种上获得了良好的性能,成为研究的热点。本文在预训练模型的基础上进行越南语连续语音识别。将语音学信息引入到基于链接时序分类代价函数(Connectionist temporal classification,CTC)的声学建模中,选取音素与含位置信息的音素作为基础单元。为了平衡建模单元数目以及模型的精细程度,采用字节对编码(Byte-pair encoding,BPE)算法生成音素子词,将上下文信息结合到声学建模过程。实验在美国NIST的BABEL任务低资源的越南语开发集上进行,所提算法相对wav2vec 2.0基线系统有明显改进,识别词错误率由37.3%降低到29.4%。

    Abstract:

    Based on the unsupervised pre-training technology, wav2vec 2.0 has become a research hotspot for the state of the art performance in many low-resource languages. In this paper, the Vietnamese continuous speech recognition is carried out on the basis of the pre-trained model. The phonetics information is integrated into the connectionist temporal classification (CTC) loss function based acoustic modeling, and the phones and the position dependent phones are selected as the basic modeling units. To balance the number of modeling units and the refinement of the model, a byte-pair encoding (BPE) algorithm is used to generate phone based subwords, and the contextual information is integrated into the acoustic modeling process. Experiments are carried out on the low-resource Vietnamese development set of NIST’s BABEL task, and the proposed algorithm significantly improves the wav2vec 2.0 baseline system. The word error rate is reduced from 37.3% to 29.4%.

    参考文献
    相似文献
    引证文献
引用本文

沈之杰,郭武.基于预训练与音素字节对编码的越南语识别[J].数据采集与处理,2023,38(1):101-110

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2021-07-27
  • 最后修改日期:2021-12-27
  • 录用日期:
  • 在线发布日期: 2023-01-25