基于大语言模型的少数民族低资源语言模型训练
DOI:
作者:
作者单位:

1.上海理工大学;2.上海熙瑾信息技术有限公司

作者简介:

通讯作者:

基金项目:


Training Low-Resource Language Models for Ethnic Minorities Based on Large Language ModelsLi Rende1,2, Feng Sumin1, Tian Jintai3*
Author:
Affiliation:

1.University of Shanghai for Science and Technology;2.Shanghai Xijin Information Technology Co., Ltd.

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    [目的]针对藏语、维吾尔语、蒙古语等少数民族语言在大语言模型应用中面临的词汇表征稀疏、文化知识缺失、语音-文本对齐困难等问题,本文探索在低资源环境下基于小样本数据训练的有效方法,以解决关系重叠和实体嵌套等挑战。[方法]提出一种基于“知识-语音-文本”三空间融合的少数民族语言大模型训练方法MCT-3(Minority Culture-aware Training with Triple-space fusion)。该方法构建包含知识注入器(K-Adapter)、语音-文本对齐编码器(SJ-Encoder)和文化敏感解码器(CS-Decoder)的模型架构。通过知识先验注入补充民族文化语义信息并转化为系统知识,利用双粒度对齐学习实现语音与文本的精确映射,采用强化奖励机制确保生成内容的文化适宜性,从而在极少标注数据条件下实现高质量的少数民族语言理解与生成。[结果]在CSTR-MinorASR数据集上进行实验,仅使用3小时标注语音数据,MCT-3模型在藏语、维吾尔语、蒙古语上的平均词错误率(WER)达到16.0%,相比传统语音识别模型分别提升18.5和8.1个百分点,文化关键词召回率达到92.7%,比基线模型提升20个百分点以上。尽管模型在多项指标上表现优异,但仍存在一定客观约束。 [局限]当前研究仅在三种少数民族语言上进行验证,且文化敏感性评估主要依赖人工标注,在应用场景方面还有拓展空间。[结论]本文方法能够有效解决少数民族语言大模型训练中的关键技术难题,三空间融合架构和文化敏感机制可以缓解小样本场景下模型训练效果不佳的问题,有效提高少数民族语言理解与生成的准确率,为低资源语言的智能化应用提供了可行的技术路径。

    Abstract:

    [Objective]Addressing the challenges faced by minority languages such as Tibetan, Uyghur, and Mongolian in large language model applications, including sparse vocabulary representation, lack of cultural knowledge, and difficulties in speech-text alignment, this paper explores effective methods for training on small-sample data in low-resource environments to tackle issues such as overlapping relations and nested entities.[Methods]A minority language large model training method MCT-3 (Minority Culture-aware Training with Triple-space fusion) based on the fusion of "knowledge-speech-text" three spaces is proposed. This method constructs a model architecture that includes a Knowledge Injector (K-Adapter), a Speech-Text Alignment Encoder (SJ-Encoder), and a Culture-Sensitive Decoder (CS-Decoder). By injecting knowledge priors to supplement ethnic cultural semantic information and transform it into systematic knowledge, employing dual-granularity alignment learning for precise mapping between speech and text, and using a reinforcement reward mechanism to ensure the cultural appropriateness of generated content, it achieves high-quality minority language understanding and generation under conditions of very limited annotated data.[Results]Experiments conducted on the CSTR-MinorASR dataset, using only 3 hours of labeled speech data, show that the MCT-3 model achieves an average word error rate (WER) of 16.0% in Tibetan, Uyghur, and Mongolian. This represents an improvement of 18.5 and 8.1 percentage points over traditional speech recognition models, respectively. The recall rate for cultural keywords reaches 92.7%, which is more than 20 percentage points higher than the baseline model.[Limitations]The current research has only been validated on three minority languages, and the assessment of cultural sensitivity mainly relies on manual annotation, leaving room for expansion in application scenarios.[Conclusions] The method described in this paper can effectively address key technical challenges in training large models for minority languages. The three-space fusion architecture and culturally sensitive mechanisms can mitigate the poor performance of model training in small-sample scenarios, effectively improving the accuracy of understanding and generating minority languages, and providing a feasible technical path for intelligent applications of low-resource languages.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2025-11-27
  • 最后修改日期:2026-01-19
  • 录用日期:2026-01-20
  • 在线发布日期: