基于大语言模型的航空发动机领域高质量数据集构建
作者:
作者单位:

1.南京航空航天大学人工智能学院,南京 211106;2.模式分析与机器智能工业和信息化部重点实验室(南京航空航天大学), 南京 211106;3.中国商用飞机有限责任公司上海飞机设计研究院,上海 201210

作者简介:

通讯作者:

基金项目:


Construction of High-Quality Dataset in Aero-engine Domain Based on Large Language Model
Author:
Affiliation:

1.College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;2.MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (Nanjing University of Aeronautics and Astronautics), Nanjing 211106, China;3.COMAC Shanghai Aircraft Design & Research Institute, Shanghai 201210,China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    随着人工智能技术的快速发展,大语言模型(Large language models, LLMs)在多个领域的应用日益广泛。然而,航空发动机领域由于缺乏高质量的人工编写问答数据集,限制了专家问答大模型的应用。本文提出了一种基于LLMs的问答数据集自动化构建方法,该方法无需人工干预即可生成高质量的开放式问答数据。在数据生成阶段,采用上下文学习方法和输入优先生成策略,增强了生成数据的稳定性;在数据过滤阶段,通过原文相似度的忠实度评估和大模型的语义质量评估,建立了数据质量自动评估机制,有效筛选出受幻觉影响的异常数据,确保数据的事实可靠性。实验结果表明,该方法显著提升了生成数据集的质量,经过指令微调后的模型在航空发动机领域的知识问答表现显著提升。本文的研究成果不仅为航空发动机领域的大模型应用提供了坚实基础,也为其他复杂工程领域的数据集自动化构建提供了参考。

    Abstract:

    With the rapid advancement of artificial intelligence technology, large language models (LLMs) are increasingly being applied across various domains. However, the lack of high-quality, manually curated question-answering datasets in the field of aero-engine has hindered the practical application of expert-level question-answering model. To address this issue, this paper proposes an automated method for constructing question-answering datasets based on LLMs, which generates high-quality open-domain question-answering data without human intervention. During the data generation phase, the method employs in-context learning and input-priority generation strategies to enhance the stability of the generated data. In the data filtering phase, a dual evaluation mechanism is established, combining faithfulness assessment based on source text similarity and semantic quality evaluation using large language models, to automatically filter out hallucinated or anomalous data and ensure factual reliability. Experimental results demonstrate that the proposed method significantly improves the quality of the generated dataset. Models fine-tuned on this dataset exhibit notable performance improvements in aero-engine domain knowledge question-answering tasks. The findings of this study not only provide a solid foundation for the application of large language model in the aero-engine domain but also offer valuable insights for automated dataset construction in other complex engineering fields.

    参考文献
    相似文献
    引证文献
引用本文

邹冠沄,王存俊,孔寅豪,马小庆,李丕绩.基于大语言模型的航空发动机领域高质量数据集构建[J].数据采集与处理,2025,40(3):603-615

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2024-10-13
  • 最后修改日期:2025-01-15
  • 录用日期:
  • 在线发布日期: 2025-06-13