具身智能数据采集与处理综述
作者:
作者单位:

1清华大学软件学院,北京 100084;2清华大学北京信息科学与技术国家研究中心,北京 100084

作者简介:

通讯作者:

基金项目:

国家自然科学基金(62525103,62271281)。


A Survey of Datasets Collection and Processing for Embodied Intelligence
Author:
Affiliation:

1School of Software, Tsinghua University, Beijing 100084, China;2BNRist Tsinghua University, Beijing 100084, China

Fund Project:

National Natural Science Foundation of China (Nos.62525103,62271281).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    近年来,视觉-语言-动作(Vision-language-action,VLA)模型在具身智能领域受到广泛关注。随着模型规模不断扩大,其在复杂任务中的泛化能力持续提升,而模型性能的提升在很大程度上依赖于高质量、大规模训练数据。然而,与自然语言处理和计算机视觉领域可以直接利用互联网海量数据不同,具身智能数据通常涉及真实机器人与环境之间的物理交互,数据采集成本高、获取过程复杂。如何高效获取、处理并组织这些数据,已成为制约具身智能发展的关键问题。针对上述问题,本文对具身智能领域的数据采集与处理方法进行了系统梳理。首先,从数据来源与采集方式角度总结了当前主流的数据获取范式,并分析了不同范式在数据质量、规模潜力和采集成本等方面的特点与局限。其次,进一步总结了具身智能数据的标准化处理流程,重点分析了动作表示对齐、多模态时序同步、语言语义标准化以及数据质量控制等关键技术环节。最后,讨论了具身智能数据生态的发展趋势,指出目前遇到的困难以及未来可能的发展路径。本文的总结与分析可为具身智能领域数据集构建以及大规模机器人学习研究发展提供帮助。

    Abstract:

    In recent years, vision-language-action (VLA) models have attracted significant attention in the field of embodied intelligence. As model scale continues to grow, their ability to generalize across complex tasks has steadily improved. However, such performance improvements rely heavily on the availability of large-scale, high-quality training data. Unlike natural language processing and computer vision, which can directly leverage massive internet data, data collection in embodied intelligence typically involves physical interactions between real robots and their environments, leading to high collection costs and complex acquisition processes. Efficiently obtaining, processing, and organizing such data has therefore become a critical challenge for advancing embodied intelligence. To address this issue, this paper provides a systematic review of data collection and processing methods in embodied intelligence. First, we summarize the major data acquisition paradigms from the perspective of data sources and collection strategies, and analyze their characteristics and limitations in terms of data quality, scalability, and collection cost. Second, we present a standardized processing pipeline for embodied intelligence datasets, focusing on key technical components such as action representation alignment, multimodal temporal synchronization, language semantic normalization, and data quality control. Finally, we discuss the evolving data ecosystem in embodied intelligence, highlighting current challenges and potential future directions. The analysis presented in this paper aims to provide insights for dataset construction and large-scale robot learning research in embodied intelligence.

    参考文献
    相似文献
    引证文献
引用本文

丁贵广,朱晨,王潇婉,陈辉.具身智能数据采集与处理综述[J].数据采集与处理,2026,(2):332-346

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2026-01-09
  • 最后修改日期:2026-02-25
  • 录用日期:
  • 在线发布日期: 2026-04-15