MonoDI:基于融合深度实例的单目3D目标检测
DOI:
作者:
作者单位:

南京林业大学

作者简介:

通讯作者:

基金项目:

国家重点研发计划


MonoDI:Monocular 3D object detection based on fusing depth instances
Author:
Affiliation:

Nanjing Forestry University

Fund Project:

National key research and development plan of China

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    单目3D目标检测旨在定位输入单个2D图像中物体的3D边界框,这在缺乏图像深度信息的情况下是一个极具困难的任务。针对2D图像在推理时的深度信息缺失以及深度图背景噪声干扰导致检测效果不佳的问题,本文提出一种融合深度实例的单目3D目标检测方法MonoDI。其关键思想在于利用有效的深度估计网络所生成的深度信息结合实例分割掩码得到深度实例再与2D图像信息融合来帮助物体3D信息的回归。为了更好地利用深度实例信息,本文设计了一个迭代深度感知注意力融合模块(iterative depth aware attention fusion module,iDAAFM),将深度实例特征与2D图像特征融合以得到含有物体清晰边界和深度信息的特征表示;另外,在训练和推理过程引入残差卷积结构代替一般的单一卷积结构,以保证网络在处理融合信息时的稳定与高效。同时,设计了一个3D边界框不确定性辅助任务,在训练中帮助任务学习边界框的生成,提高单目3D目标检测任务的精度。在KITTI数据集上对此方法进行验证,实验结果表明,本文方法MonoDI在对中等难度下的车辆类别检测的平均精度比基线方法提高了4.41个百分点,且优于MonoCon、MonoLSS等对比方法,同时在KITTI-nuScenes跨数据集实验中取得了较优的结果。

    Abstract:

    Monocular 3D object detection aims to locate the 3D bounding boxes of objects in a single 2D input image, which is an extremely challenging task in the absence of image depth information. To address the issues of poor detection performance due to the absence of depth information during inference on 2D images and background noise interference in depth maps, this paper proposes a monocular 3D object detection method called MonoDI, which integrates depth instances. The key idea is to utilize depth information generated by an effective depth estimation network and combine it with instance segmentation masks to obtain depth instances, and then integrate the depth instances with 2D image information to aid in regressing 3D object information. To better use the depth instance information, this paper designs an Iterative Depth-Aware Attention Fusion module (iterative depth aware attention fusion module,iDAAFM), integrating depth instance feature with 2D image feature to obtain a feature representation with clear object boundaries and depth information. Subsequently, a residual convolutional structure is introduced during training and inference to replace the general single convolutional structure to ensure stability and efficiency of the network when processing fused information. At the same time, we design a 3D bounding box uncertainty auxiliary task to assist task help in learning the generation of bounding boxes in training and improving the accuracy of monocular 3D object detection. Finally, the effectiveness of the method is validated on the KITTI dataset and experimental results show that the proposed method improves the average precision of vehicle category detection under moderate difficulty on the KITTI dataset by 4.41 percentage points compared to the baseline method, and outperforms comparative methods such as MonoCon and MonoLSS, and it also achieves superior results on the KITTI-nuScenes cross-dataset evaluation.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2024-09-03
  • 最后修改日期:2025-01-15
  • 录用日期:2025-01-30
  • 在线发布日期: