Vietnamese Speech Recognition Based on Pre-training and Phone-Based Byte-Pair Encoding
CSTR:
Author:
Affiliation:

Department of Electronic Engineering & Information Science, University of Science and Technology of China, Hefei 230027, China

Clc Number:

TN912.34

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Based on the unsupervised pre-training technology, wav2vec 2.0 has become a research hotspot for the state of the art performance in many low-resource languages. In this paper, the Vietnamese continuous speech recognition is carried out on the basis of the pre-trained model. The phonetics information is integrated into the connectionist temporal classification (CTC) loss function based acoustic modeling, and the phones and the position dependent phones are selected as the basic modeling units. To balance the number of modeling units and the refinement of the model, a byte-pair encoding (BPE) algorithm is used to generate phone based subwords, and the contextual information is integrated into the acoustic modeling process. Experiments are carried out on the low-resource Vietnamese development set of NIST’s BABEL task, and the proposed algorithm significantly improves the wav2vec 2.0 baseline system. The word error rate is reduced from 37.3% to 29.4%.

    Reference
    Related
    Cited by
Get Citation

SHEN Zhijie, GUO Wu. Vietnamese Speech Recognition Based on Pre-training and Phone-Based Byte-Pair Encoding[J].,2023,38(1):101-110.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:July 27,2021
  • Revised:December 27,2021
  • Adopted:
  • Online: January 25,2023
  • Published:
Article QR Code