Abstract:Monocular 3D object detection aims to locate the 3D bounding boxes of objects in a single 2D input image, which is an extremely challenging task in the absence of image depth information. To address the issues of poor detection performance due to the absence of depth information during inference on 2D images and background noise interference in depth maps, this paper proposes a monocular 3D object detection method called MonoDI, which integrates depth instances. The key idea is to utilize depth information generated by an effective depth estimation network and combine it with instance segmentation masks to obtain depth instances, and then integrate the depth instances with 2D image information to aid in regressing 3D object information. To better use the depth instance information, this paper designs an Iterative Depth-Aware Attention Fusion module (iterative depth aware attention fusion module,iDAAFM), integrating depth instance feature with 2D image feature to obtain a feature representation with clear object boundaries and depth information. Subsequently, a residual convolutional structure is introduced during training and inference to replace the general single convolutional structure to ensure stability and efficiency of the network when processing fused information. At the same time, we design a 3D bounding box uncertainty auxiliary task to assist task help in learning the generation of bounding boxes in training and improving the accuracy of monocular 3D object detection. Finally, the effectiveness of the method is validated on the KITTI dataset and experimental results show that the proposed method improves the average precision of vehicle category detection under moderate difficulty on the KITTI dataset by 4.41 percentage points compared to the baseline method, and outperforms comparative methods such as MonoCon and MonoLSS, and it also achieves superior results on the KITTI-nuScenes cross-dataset evaluation.