Based on the unsupervised pre-training technology, wav2vec 2.0 has become a research hotspot for the state of the art performance in many low-resource languages. In this paper, the Vietnamese continuous speech recognition is carried out on the basis of the pre-trained model. The phonetics information is integrated into the connectionist temporal classification (CTC) loss function based acoustic modeling, and the phones and the position dependent phones are selected as the basic modeling units. To balance the number of modeling units and the refinement of the model, a byte-pair encoding (BPE) algorithm is used to generate phone based subwords, and the contextual information is integrated into the acoustic modeling process. Experiments are carried out on the low-resource Vietnamese development set of NIST’s BABEL task, and the proposed algorithm significantly improves the wav2vec 2.0 baseline system. The word error rate is reduced from 37.3% to 29.4%.