Special issue

  • Display Type:
  • Text List
  • Abstract List
  • 1  Pedestrian Detection and Tracking Algorithm Based on GhostNet and Attention Mechanism
    WANG Lihui YANG Xianzhao LIU Huikang HUANG Jingjing
    2022, 37(1):108-121. DOI: 10.16337/j.1004-9037.2022.01.009
    [Abstract](1027) [HTML](1950) [PDF 4.40 M](2646)
    Abstract:
    Aiming at the problems of low accuracy and slow speed when only relying on traditional object detection and tracking algorithms in complex scenes, a pedestrian detection and tracking algorithm based on GhostNet and attention mechanism is proposed. First, the backbone network of YOLOv3 is replaced with GhostNet to retain the multi-scale prediction part, the Ghost module is used to reduce the parameters and calculations of the deep network model, and the attention mechanism is integrated into the Ghost module to give higher weight to important features. Then, the direct evaluation index GIoU of object detection is introduced to guide the regression task. Finally, the Deep-Sort algorithm is used for tracking. Experiments on public data sets show that: The mean Average precision (mAP) of the improved model reaches 92.53%, and the frame rate is 2.5 times that of the YOLOv3 model; The tracking accuracy of the proposed algorithm is better than that before the improvement and that of other algorithms; The algorithm can track multi-object pedestrians in complex scenes accurately and effectively, and has strong robustness.
    2  Power Target Detection in Aerial Images Based on SSD Deep Neural Network
    SHI Xin Hua Chenbing ZHANG Kai WANG Caijian WANG Shiyong
    2022, 37(1):207-216. DOI: 10.16337/j.1004-9037.2022.01.018
    [Abstract](1012) [HTML](1166) [PDF 2.64 M](2807)
    Abstract:
    To improve the intelligent design of the rural power distribution network, this paper proposes to identify the typical power targets that affect the design of the distribution network in the aerial images using deep neural networks. Firstly, we use UAV to obtain high spatial resolution aerial images of the distribution network planning area, and construct a data set containing 11 categories and 32 118 typical power targets. Then, through the practical comparison of Faster-RCNN, YOLO and single shot multibox detector (SSD) methods, SSD is selected to detect and identify typical power targets. Finally, feasible areas of distribution network pole planning are obtained. Experimental results show that compared with Faster-RCNN and YOLO, SSD can effectively detect and identify typical power targets such as the substation, distribution room and box transformer, and the recognition accuracy reaches 68.5%, which meets the practical requirements. The proposed method provides the technical support for the power design, reduces the labor cost and improves the efficiency of distribution network design.
    3  JPEG Image Digital Watermarking System Based on FPGA
    CHEN Xin SHI Dong ZHANG Ying
    2022, 37(1):240-246. DOI: 10.16337/j.1004-9037.2022.01.021
    [Abstract](1116) [HTML](885) [PDF 1.74 M](2349)
    Abstract:
    This paper designs a JPEG compressed domain digital watermarking system based on FPGA, realizing the real-time embedding of watermark information in JPEG image. After watermarking information is preprocessed by binary and Arnold transform, watermark is embedded into the quantized DCT coefficients with improved LSB embedding algorithm. Then, to complete the JPEG compressed domain digital watermark embedding, the modified DCT coefficients are processed by entropy coding process, and JPEG encoding file is generated. Finally, the design is implemented and tested by the joint system of FPGA develop board and host computer. The results show that the proposed algorithm has a good performance of invisible effect and robustness, and a high throughput.
    4  Overview of Non-Line-of-Sight Imaging Technology Based on Transient Images
    LIANG Yun SONG Boyan
    2022, 37(1):21-34. DOI: 10.16337/j.1004-9037.2022.01.002
    [Abstract](1281) [HTML](2287) [PDF 3.26 M](3435)
    Abstract:
    Transient image is a fast image sequence in which a scene responds to light pulses. By capturing the time dimension information, the transient image realizes the use of the scene information contained in the time domain, and the non-line-of-sight imaging is the most typical application of transient images in the field of scene analysis. It is a technology for imaging objects or scenes outside the line of sight, and has emerged at home and abroad in recent years. According to different imaging mechanisms, this paper classifies different imaging methods of transient images, and compares a variety of non-line-of-sight imaging algorithms based on transient images according to different algorithm principles or implementation effects. Finally, the challenges of non-line-of-sight imaging technology based on transient images are summarized, and the future development direction is prospected.
    5  Change Detection of Remote Sensing Image Based on Siamese Multi-scale Attention Network and Its Anti-noise Ability Research
    DU Junhan LAI Jian WANG Xue TAN Kun
    2022, 37(1):35-48. DOI: 10.16337/j.1004-9037.2022.01.003
    [Abstract](897) [HTML](1759) [PDF 4.94 M](2689)
    Abstract:
    Remote sensing image change detection has resulted in great breakthroughs in the field of land cover observations. However, the noise of remote sensing image will impact the performance of the change detection methods. To improve the accuracy of change detection, a change detection method based on the Siamese multi-scale attention network (SMA-Net) has been proposed. Firstly, we combine atrous convolutional layers with different dilated rates and spatial attention module to get the multi-scale feature extraction module. Then, the feature maps on the same layer are subtracted to get the difference feature maps and the channel attention mechanism is used to enhance the feature extraction effect. Finally, the change detection result is output by fully connection layers. The proposed method is compared with other change detection methods on the original remote sensing image data with or without noise data. The experimental result shows that the change detection method which uses the spectral information of a single pixel as input, like support vector machine method, is susceptible to the image noise, and the convolutional neural network (CNN) based method is much less susceptible to the image noise. The proposed SMA-Net outperforms other methods on the accuracy and is less susceptible to the image noise.
    6  Multi-size Occlusion Face Detection Based on Hierarchical Attention Enhancement Network
    WANG Linge JIANG Baojun PAN Tiejun
    2022, 37(1):73-81. DOI: 10.16337/j.1004-9037.2022.01.006
    [Abstract](870) [HTML](1768) [PDF 3.28 M](2244)
    Abstract:
    Based on the single shot multibox detector (SSD) single-stage face detection model, this paper proposes a multi-size occlusion face detection method based on a hierarchical attention enhancement network to solve the problem of poor accuracy of face detection under complex partial occlusion. Firstly, on the multi-layer original feature map of SSD basic network, the attention enhancement mechanism is introduced to improve the response value of the visible region of the face. Then, different anchor sizes are designed for different enhancement feature layers to improve the hierarchical recognition effect of multi-scale occluded face. In training, the attention loss function, the classification loss function and the regression loss function are fused into a multi-task loss function to jointly optimize the network parameters. Experiments on the WIDER FACE dataset and the MAFA occlusion face dataset show that the detection accuracy and timeliness of the method are better than those of the current mainstream occlusion face detection methods.
    7  Defogging Algorithm Based on Power Exponent Stretching
    LI Zhongguo WU Haochen FU Qigao XI Qian WU Jinkun
    2022, 37(1):62-72. DOI: 10.16337/j.1004-9037.2022.01.005
    [Abstract](721) [HTML](1636) [PDF 2.75 M](2144)
    Abstract:
    After comparing three channels of RGB(Red-green-blue) and three channels of HSV(Hue-saturation-value) in the same scene between clear and fog pictures, a haze removal algorithm based on power exponent stretching is proposed. Firstly, the image is transformed from RGB to HSV space. Then the saturation component and the brightness component are exponentially stretched with power of 1—3,and then they are both adjusted to their suitable range. After stretching transformation of saturation and brightness, the image is transformed from HSV to RGB space to generate enhanced defogging images. Taking the mean value of saturation, brightness index, information entropy and contrast as defog evaluation indexes, the optimal stretching power index combination is determined. The optimal power index combination is used to complete the defogging process. At the same time, it is decided whether to find the optimal power index again according to the change of image average saturation or the length of time interval. Finally, the fog removal algorithm is implemented by multi-process programming with the Python software. When the image resolution is 400 pixel×300 pixel, it takes 5.077—6.160 s to optimize the power index parameters on the raspberry PI. For one frame defogging, the first frame takes longer time of 0.308 s. The other frames take 0.077—0.168 s to removal haze for a single frame.
    8  Methane Premixed Flame Equivalence Ratio Measurement Based on Feature Engineering and Support Vector Machine
    CHEN Changyou FU Yuwen TU Peichi SHU Wen YANG Jiansheng
    2022, 37(1):194-206. DOI: 10.16337/j.1004-9037.2022.01.017
    [Abstract](954) [HTML](1804) [PDF 1.35 M](2238)
    Abstract:
    Flame equivalence ratio measurement using flame color modeling method, is an emerging research direction in the combustion diagnosis technology. At present, the modeling methods mainly use the blue/green color features (B/G) in the RGB(Red-green-blue)model as the modeling input, however, the color equivalence ratio modeling by single color ratio fitting has large uncertainty and measurement errors. Therefore, this paper proposes to use the multi-color feature parameters under different-color models as the modeling inputs. Firstly, the digital flame color distribution (DFCD) technology is used to process the methane premixed flame image and obtain the region of interest (RoI) of flame images. Secondly, the flame color feature variables are comprehensively analyzed, and the multi-color features under different color models are designed and extracted, which are 36 color features. Then, the Spearman rank correlation analysis and random forest (RF) algorithm are used to screen out the deeper color features, and 16 dimensional high-quality features are selected. At last, the optimal support vector machine (SVM) parameters are selected using the grid search method (GSM). Furthermore, the equivalence ratio measurement model of premixed methane flame is trained by SVM using the feature subset constructed. The algorithm is compared with the traditional BP neural network and the extreme learning machine (ELM) algorithm. Experimental results show that the algorithm has better regression prediction effect, and the mean square error (MSE) decreases to 0.023.
    9  Comparative Analysis of EEG Time-Frequency Features of Motor Execution and Motor Imagination Under Visual Guidance
    WU Biao QIN Bing WU Xin ZHOU Lu QIAN Zhiyu LI Weitao GAO Fan ZHU Qiaoqiao
    2022, 37(1):164-172. DOI: 10.16337/j.1004-9037.2022.01.014
    [Abstract](1253) [HTML](1948) [PDF 2.03 M](2693)
    Abstract:
    The technology of brain-computer interface (BCI) based on motor imagery (MI) has developed rapidly in the past few decades and been widely used in various fields. To compare the brain electrical activity difference between motor execution (ME) and MI, a method based on the time-frequency domain analysis of electroencephalogram (EEG) is proposed. The visually induced upper limb ME and MI control experiments are conducted and the EEG signals of ten healthy subjects are collected and preprocessed. Then the signals are decomposed and converted into eigenvalues of each band through the time-frequency analysis method. Finally, the power values of each band of ME and MI are analyzed and the power differences between each band of ME and MI are computed. The results show that the alpha wave is dominant wave in the process of MI while the delta wave is dominant wave in the process of ME. Compared with MI, the alpha wave during ME shows a downward trend and the delta wave shows an upward trend. The results of this study show that there is significant difference in EEG between ME and MI, which is important for improving the real-time and universal performance of MI based BCI systems.
    10  Dual-Path Siamese Network Visual Tracking Method with Attention Mechanism
    XIE Jiang ZHU Yan SHEN Tao ZENG Kai LIU Yingli
    2022, 37(1):94-107. DOI: 10.16337/j.1004-9037.2022.01.008
    [Abstract](1143) [HTML](2088) [PDF 4.01 M](2408)
    Abstract:
    Traditional visual tracking methods based on the Siamese network extract pairs of frames from a large number of videos and train them on the offline independently at the stagey of training. They lack the update of the model features and neglect the background information, so the tracking accuracy is a little bit low in the complex environments such as background clutter. In response to the above problems, this paper proposes a dual-path Siamese network visual tracking method with the attention mechanism. The method mainly includes the feature extractor part and the feature fusion part. In the feature extractor part, the residual network is improved and a dual-path network model is designed. By combining the reusability of the residual networks to features of the former layer and the extraction of new features from the dense networks, these two networks are spliced for the feature extraction. At the same time, this paper uses the dilated convolution to replace the traditional convolution, which improves the resolution on the condition of maintaining a certain receptive field. This dual-path feature extraction method can implicitly update the model features, so that obtain the more accurate image feature information. Moreover, the attention mechanism is introduced to the feature fusion part, which can distribute the different weights to the different parts of the feature maps. In the channel domain, the method screens the valuable target image information and enhances the interdependence between the channels. In the spatial domain, it also pays more attention to the local important information and learns more rich contextual connections, which effectively improves the accuracy of object tracking. To confirm the effectiveness of the method, some experiments are conducted on the OTB100 and VOT2016 datasets. We use precision, success rate and expect average overlap-rate as the evaluation criterion, and their values are 0.868, 0.641 and 0.350 respectively on the two datasets, which increase by 5.1%, 2.0% and 0.9% compared with those of the benchmark model. Experimental results show that the proposed method makes full use of the advantages of different networks, and while ensuring the accuracy of the model, it can adapt to the deformation of the target well, reduce the interference between the similar objects, and achieve more stable tracking effect.
    11  Medical Image Synthesis Based on Optimized Cycle-Generative Adversarial Networks
    CAO Guogang LIU Shunkun MAO Hongdong ZHANG Shu CHEN Ying DAI Cuixia
    2022, 37(1):155-163. DOI: 10.16337/j.1004-9037.2022.01.013
    [Abstract](1176) [HTML](1726) [PDF 1.56 M](2692)
    Abstract:
    The radiation treatment plan system needs to calculate the dose distribution accurately based on CT images, but sometimes clinical MR images can only be obtained. Image synthesis effectively creates new modality images from another modality, which enhances image information. This paper presents a new method of synthesizing high precision and definition of CT images from MR images. To synthesize clearly pseudo CT images, an improved cycle-consistent generative adversarial network (CycleGAN) with densely connected convolutional network (DenseNet) is proposed. Avoiding the disappearance of input information and the vanishing of gradient information, the improved network can synthesize more credible CT images. Compared with the original method, the proposed method is reduced by 5.9% on mean absolute error, increased by 1.1% on structural similarity and increased by 4.4% on peak signal to ratio, which is trained and tested on the dataset of 18 patients. And compared with the deep convolutional neural network and the atlas-based method, the improved CycleGAN is reduced by 0.065% and 0.55% on relative error, respectively. The proposed method can synthesize more vivid CT images owing to the advantages of deep learning model, which better meets the requirements of dose calculation in radiation treatment planning system.
    12  Person Re-identification Based on Hard Negative Sample Confusion to Enhance Robustness of Features
    Hao Ling Duan Jizhong Pang Jian
    2022, 37(1):122-133. DOI: 10.16337/j.1004-9037.2022.01.010
    [Abstract](743) [HTML](1621) [PDF 11.46 M](2980)
    Abstract:
    With the rise of deep learning, person re-identification has gradually become a hot topic in the computer vision field. It performs cross-camera retrieval through a given query image, and finds the images that match the query identity. However, due to the factors such as background and illumination under different cameras, there are a large number of hard negative samples in the collected pedestrian datasets, and the performance of the model trained using these samples is bad and lacks robustness. Therefore, in order to improve the ability of the model to discriminate such negative samples, a novel method of synthesizing images with hard negative samples information through confusion factors is designed. For each input batch images, the similarity measurement is used to find the hard negative sample corresponding to each image, the new generated images with the clues of negative samples are synthesized through the confusion factor, and the model is prompted to mine the negative samples information in a supervised manner thus improving the model robustness. A large number of comparative experiments show that the proposed method achieves high performance on the mainstream datasets. The ablation study proves the effectiveness of the proposed method.
    13  Dam Crack Detection Method Based on Universal Target Detector
    ZHAO Fan LI Linyun WEI Renjie ZHANG Zhiwei
    2022, 37(2):405-414. DOI: 10.16337/j.1004-9037.2022.02.013
    [Abstract](1019) [HTML](1858) [PDF 4.23 M](3576)
    Abstract:
    Aiming at the problem that the existing dam disease detection methods can only roughly locate the area where the crack is located, a dam crack extraction method based on a universal target detector is proposed. Firstly, a two-target detector is designed to detect the crack area and the water stain area as two independent targets on the image at the same time. Secondly, the geometric position relationship between the crack area and the water stain area associated with the same crack is established. Finally, the upper boundary of the water stain frame contained in the crack frame is uniformly sampled, and the curve fitting is performed on the sampling points to obtain the crack curve. The experimental results show that the proposed algorithm can not only accurately detect the crack frame and water stain frame, but also fit the crack curve completely, and it has been effectively verified in the detection of dam diseases with millimeter-level width.
    14  Natural Scene Text Detection Based on Local and Global Dual-feature Fusion
    LI Yunhong YAN Junhong HU Lei
    2022, 37(2):415-425. DOI: 10.16337/j.1004-9037.2022.02.014
    [Abstract](877) [HTML](1351) [PDF 1.89 M](2251)
    Abstract:
    The shape, direction and category of text in natural scenes are varied, and scene text detection is still a challenge. In order to better separate text from non-text and accurately locate the text area in natural scene image, this paper proposes a text detection network that fuses local and global features. Multi-scale global feature fusion is realized through jump connection, and the constant residual block is improved to realize local fine-grained feature fusion, thereby reducing the loss of feature information and enhancing the strength of feature extraction in text regions. The combination of polygon offset text field and text edge information is used to local text region accurately. In order to evaluate the effectiveness of the method in this paper, multiple sets of comparative experiments are conducted on the existing classic data sets ICDAR2015 and CTW1500. The experimental results show that the method has better performance in text detection in complex scenes.
    15  Survey on New Progresses of Deep Learning Based Computer Vision
    LU Hongtao LUO Mukun
    2022, 37(2):247-278. DOI: 10.16337/j.1004-9037.2022.02.001
    [Abstract](4178) [HTML](4550) [PDF 12.48 M](6144)
    Abstract:
    Deep learning has recently achieved great breakthroughs in some fields of computer vision. Various new deep learning methods and deep neural network models were proposed, and their performance was constantly updated. This paper makes a survey on the new progresses of applications of deep learning on computer vision since 2016 with emphases on some typical networks and models. We first investigate the mainstream deep neural network models for image classification including standard models and light-weight models. Then, we introduce some main methods and models for different computer vision fields including object detection, image segmentation and image super-resolution. Finally, we summarize deep neural network architecture searching methods.
    16  Image Interpolation-Based Few-Shot Learning of Handwritten Digit Recognition
    SONG Wei XIE Jianping GAO Qian XIE Liangxu XU Xiaojun
    2022, 37(2):298-307. DOI: 10.16337/j.1004-9037.2022.02.004
    [Abstract](1107) [HTML](1240) [PDF 1.80 M](2244)
    Abstract:
    The high performance of artificial intelligence (AI) is usually dependent on large and sufficient data to train parameters. How to improve the predictive performance in the case of insufficient data, i.e., few-shot learning, is one of the important research subjects in the AI field. An image interpolation-based few-shot learning strategy is proposed, whose feasibility is verified in the task of handwritten digit image recognition. The few-shot learning performance of dense neural network and convolutional neural network in MNIST and USPS handwritten digit image recognition is systematically studied. The calculation results show that the image interpolation-based data enhancement method can evidently promote the characteristics extraction ability and learning efficiency of neural network in small sample data. Moreover, selecting the appropriate scaling coefficient of image interpolation can further optimize the few-shot learning performance of neural network.
    17  Dynamic Visual SLAM Based on Unified Geometric-Semantic Constraints
    Shen Yehu Chen Jiahao Li Xing Jiang Quansheng Xie Ou Niu Xuemei Zhu Qixin
    2022, 37(3):597-608. DOI: 10.16337/j.1004-9037.2022.03.010
    [Abstract](1888) [HTML](1282) [PDF 1.53 M](9289)
    Abstract:
    Traditional visual simultaneous localization and mapping (SLAM) algorithms rely on the scene rigidity assumption. However, when dynamic objects exist in the scene, the stability of the SLAM system will be affected and the accuracy of pose estimation will be reduced. Currently, most of the existing methods apply probability strategies and geometric constraints to reduce the impact caused by a small number of dynamic objects. But when the number of dynamic objects in the scene is high, these methods will fail. In order to deal with this problem, a novel algorithm is proposed in this paper. It combines the dynamic visual SLAM algorithm with the multi-target tracking algorithm. Firstly, a semantic instance segmentation network together with geometric constraints is introduced to assist the visual SLAM module to effectively separate the static feature points from the dynamic ones, and at the same time, it can also achieve the better multi-target tracking performance. Furthermore, the trajectory and velocity information of the moving objects can also be estimated, which can provide decision information for autonomous robots navigation. The experimental results on KITTI dataset show that the localization accuracy of the proposed algorithm is improved by about 28% compared with ORB-SLAM2 algorithm in dynamic environments.
    18  Virtual Try-on Network for Graduation Photo Generation
    SHENG Peizhuo LI Tingyu LI Tianbao SONG Dan LIU An’an
    2022, 37(5):1145-1156. DOI: 10.16337/j.1004-9037.2022.05.019
    [Abstract](1071) [HTML](896) [PDF 2.98 M](2209)
    Abstract:
    In order to solve the problem that the existing virtual fitting methods cannot be applied to academic uniforms, a virtual try-on method oriented to the generation of academic uniforms is proposed. The method first trains the image-based virtual try-on network composed of the clothing deformation module and the virtual try-on module, and then generates try-on results through the trained network of the portrait and the academic dress image. Then, the generated academic dress try-on results are synthesized with the specific background through the background fusion module. During the experiment, this paper constructs a new dataset of academic dress and long skirt. From the experimental results, the algorithm proposed in this paper can greatly reduce the impact of the clothes in the original portrait on the academic dress try-on, and can better complete the academic dress try-on work and generate more ideal fitting results.
    19  A Survey on Application of Deep Learning in Photoacoustic Image Reconstruction from Limited-View Sparse Data
    SUN Zheng HOU Yingsa
    2022, 37(5):971-983. DOI: 10.16337/j.1004-9037.2022.05.001
    [Abstract](1688) [HTML](1123) [PDF 4.04 M](4447)
    Abstract:
    Photoacoustic imaging (PAI) is a newly emerging hybrid functional imaging modality. High-quality image reconstruction is the key to improve the imaging accuracy. Incomplete photoacoustic(PA) measurements usually lead to the reduction in the imaging depth and the quality of images which are rendered by using conventional reconstruction techniques such as back projection (BP), time reversal (TR), and delay and sum (DAS). The iterative algorithms are capable of solving this issue to a certain extent at the cost of high computational burden and a properly selected regularization tool. In recent years, deep learning (DL) has exhibited promising performances in the field of medical imaging. It has also shown great potential in reconstructing images with high quality and high efficiency. This paper provides a survey on PA image reconstruction from sparely sampled data in a limited view based on DL. The current methods are summarized and classified, and their advantages and limits are also discussed.
    20  A Privacy-Preserving Medical Image Classification Scheme Based on Gray Code Scrambling and Block Chaotic Scrambling
    Chen Guoming Yuan Zeduo Long Shun Mai Shutao
    2022, 37(5):984-996. DOI: 10.16337/j.1004-9037.2022.05.004
    [Abstract](1040) [HTML](859) [PDF 4.70 M](2273)
    Abstract:
    This paper proposes a medical image encryption scheme based on Gray code scrambling and block chaotic scrambling Gray+block chaotic scrambling optimized for medical image encryption(GBCS), which is applied to privacy protection classification. First, the image is sliced by bit-planes.Then, different bit-planes of images are scrambled by the Gray code and then divided into blocks, and chaotic encryption is carried out on these blocks. Finally, the encrypted images are classified by deep learning network. We quantitatively analyze the privacy protection and classification performance of GBCS through cross-validation simulation on public breast cancer and glaucoma datasets, and perform a safety analysis of the method by histogram, information entropy, and anti-attack ability. The experimental results prove the effectiveness of our method. The performance gap of medical images before and after GBCS encryption are within an acceptable range. The proposed scheme can better balance the contradiction between performance and privacy protection requirements, and effectively resist the attack of adversarial samples.
    21  Blind Ultrasound Image Deblurring via Quadratic Sparse Extreme Channel Prior
    MA Qian HUANG Chengquan ZHENG Zehong
    2022, 37(5):1092-1100. DOI: 10.16337/j.1004-9037.2022.05.014
    [Abstract](792) [HTML](567) [PDF 1.90 M](2215)
    Abstract:
    The blurry ultrasound image is not sparse enough after the extreme channel prior deblurring, resulting in the extreme channel sparse constraint may not exist. Therefore, in order to make full use of the image channel information, a blind ultrasound image deblurring algorithm via quadratic sparse extreme channel prior is proposed by enhancing the sparsity of the obtained ultrasound image after deblurring. First, relevant theoretical proofs and experiments are presented to illustrate the feasibility of quadratic sparse extreme channel priors for constrained blurry ultrasound images. Then, making full use of the prior information of the dark and bright channels, the half-quadratic splitting method is used to estimate the intermediate image and the blur kernel. Finally, the Fourier transform is used to obtain the final clear image and blur kernel. Experimental results on the ultrasound image set show that the feasibility and superiority of the proposed algorithm compared other current ultrasound image deblurring methods.
    22  A Polarization Image Fusion Method of Visible Light in Water Navigation Scene
    JIANG Yang XIAO Changshi WEN Yuanqiao ZHAN Wenqiang CHEN Qianqian
    2022, 37(6):1376-1390. DOI: 10.16337/j.1004-9037.2022.06.018
    [Abstract](753) [HTML](602) [PDF 3.68 M](2342)
    Abstract:
    In order to improve the visual perception ability of unmanned surface vehicle(USV) in harsh navigation scene, a polarization image fusion method of visible light for water navigation scene is proposed based on hue, saturation, value(HSV) color space. The fusion rules for different regions are formulated in accordance with the polarization characteristics of the water navigation scene. And based on the HSV color space, the color information of the original scene is fused, which is tested that realizes the semantic segmentation of the harsh navigation scene image. The most striking result is that the pixel accuracy(PA) value in the flare scene is 0.768 2. And the experimental results indicate that the proposed method can enhance image contrast, highlight edge contour information, and stably obtain feature information with strong contrast as well as better target characteristics in harsh navigation scene, which improves the USV’s performance in harsh navigation scene to a certain extent.
    23  Improved Faster RCNN Algorithm for Moyamoya Disease Detection
    XU Jiawei WU Jie LEI Yu GU Yuxiang
    2022, 37(6):1391-1400. DOI: 10.16337/j.1004-9037.2022.06.019
    [Abstract](850) [HTML](594) [PDF 1.34 M](1860)
    Abstract:
    To prevent complication caused by moyamoya disease from threatening patients’ lives, timely and effective diagnosis of moyamoya disease is needed. An improved Faster RCNN algorithm for moyamoya disease detection is presented. Firstly, the digital subtraction angiography (DSA) image of internal carotid artery is extracted and enhanced. The ratio of training set, verification set and test set is 6∶2∶2. ResNet101 network is used as the feature extraction network to avoid blurring or loss of vascular features in the process of convolution and pooling. Combined with region proposal network (RPN), the location of moyamoya disease focus is located. Then replace ROI pooling in Faster RCNN model with ROI Align for feature mapping to avoid the error impact caused by quantization. The average precision (AP) is used as the evaluation index of the detection performance of the algorithm. The AP of normal samples and moyamoya disease samples are 99.23% and 89.39%, respectively. Experimental results show that the proposed method can realize the rapid and effective detection of moyamoya disease. It can accurately detect the location of moyamoya disease lesions in the complex vascular network, and provide some technical support for the auxiliary diagnosis of moyamoya disease.
    24  Blind Image Denoising and Blurring by Total Variational Extreme Channels Prior
    HU Xue HUANG Chengquan FENG Run ZHOU Lihua ZHENG Lan
    2022, 37(3):643-656. DOI: 10.16337/j.1004-9037.2022.03.014
    [Abstract](929) [HTML](681) [PDF 4.22 M](2658)
    Abstract:
    Image prior is the key to solving ill-posed problems in image restoration. Since the extreme channels prior deblurring algorithm easily produces ringing artifacts and is unable to suppress noise when the image has significant noise,we take advantage of the total variation based method that can remove noise while preserving edge features, and propose an effective blind image denoising and deblurring model based on total variation before the extreme channels prior. First of all, we introduce the total variational model in the dark channel and the bright channel to protect the edge of the image and eliminating noise or ringing artifacts. Second, the half quadratic splitting technique is used to solve the non-convex problem of the model and estimate the clear image. Finally, the blur kernel of the image is estimated by the iterative multi-scale blind deconvolution. Experimental results show that the proposed model can effectively protect the edge details of the image and eliminate the ringing artifacts while suppressing the noise. Compared with the representative methods in recent years, the robustness, subjective visual effects and objective evaluation indexes of the model are significantly improved.
    25  Dual-Attention Network for Acute Pancreatitis Diagnosis with CT Images
    Zhang Jinyi Wan Peng Sun Liang Zhang Daoqiang
    2022, 37(1):147-154. DOI: 10.16337/j.1004-9037.2022.01.012
    [Abstract](992) [HTML](1771) [PDF 2.27 M](2798)
    Abstract:
    Acute pancreatitis (AP) is one of the most common digestive disease, while the analysis based on medical images of AP still depends on simple manual features with low efficiency and accuracy, which is not commensurate with AP’s harmfulness. Due to the anatomical variation of pancreas and complications of AP, AP has complex imaging manifestations and large appearance pattern variation of lesions that exist among patients and lesion kinds. It is challenging for diagnosis of acute pancreatitis based on CT images. To address these issues, we propose a dual-attention network for acute pancreatitis diagnosis. Specifically, the dual-attention network utilizes the global feature to generate local attention feature for each local feature on different stages, and final classification is facilitated by the fusion of multi-scale attention features focusing on lesions of different scales. Meanwhile, channel-domain attention is used to produce attention features based on the dependencies between each channel to improve the model’s feature representation ability. We evaluate the proposed method on the collected real acute pancreatitis dataset. Results show that the proposed network achieve superior performance in acute pancreatitis diagnosis compared with several competing methods, with the sensitivity improved by 3.4%. And the improvement of area under the curve (AUC) the proposed network brings to ResNet is 2.7% higher than other attention model such as SENet.
    26  Regional Active Contour Image Segmentation Model Based on Local Entropy
    LI Meng ZHAN Yi WANG Yan
    2023, 38(3):586-597. DOI: 10.16337/j.1004-9037.2023.03.008
    [Abstract](603) [HTML](518) [PDF 5.69 M](1341)
    Abstract:
    To solve the problem that the regional active contour model cannot effectively segment weak targets, a regional active contour model with local entropy constraints is proposed for image segmentation. Firstly, the image is divided into two feature regions based on local entropy information. Then a local entropy binary fitting energy is constructed by using local entropy feature information, and finally a level set evolution equation is obtained, which is combined with a region-scalable fitting (RSF) model. The model considers the clustering characteristics of the gray distribution and the statistical information of the local area of the image, and it is effective in handing intensity inhomogeneity, weak edge segmentation, and flexible contour initialization. Medical image experiment results verify the effectiveness of the proposed model.
    27  Super-Resolution Reconstruction of Single Image Based on Convolutional Neural Network Gradient and Texture Compensation
    HUANG Yuqing LI Huafeng YUAN Ming ZHANG Yafei
    2023, 38(5):1112-1124. DOI: 10.16337/j.1004-9037.2023.05.010
    [Abstract](679) [HTML](602) [PDF 5.15 M](1029)
    Abstract:
    The existing super-resolution reconstruction algorithms of single image mostly pursue the peak signal-to-noise ratio (PSNR), and lack the attention to the details of image texture in the process of feature extraction, resulting in poor subjective perception of reconstructed images. In order to solve this problem, this paper proposes a single image super-resolution reconstruction algorithm based on convolutional neural network gradient and texture compensation. Specifically, three branches are designed for structure feature extraction, texture detail feature extraction and gradient compensation, and then the proposed fusion module is used to fuse the structure feature and texture detail feature. To prevent the loss of texture information in the reconstruction process, this paper proposes a texture detail feature extraction module to compensate the texture detail information of the image and enhance the texture retention ability of the network. At the same time, this paper uses the gradient information extracted by the gradient compensation module to enhance the structure information. In addition, this paper also constructs a deep feature extraction structure, combining channel attention and spatial attention to screen and enhance the information in the deep features. Finally, the second-order residual block is used to fuse the structure and texture features, so that the feature information of the reconstructed image is more perfect. The effectiveness and superiority of the proposed method are verified by comparative experiments.
    28  Faciad Image Super-Resolution Reconstruction Method with Identity Preserving
    Tian Xu Diao Hongjun Ling Xinghong
    2023, 38(2):350-363. DOI: 10.16337/j.1004-9037.2023.02.011
    [Abstract](715) [HTML](812) [PDF 2.42 M](1666)
    Abstract:
    Low resolution is an important factor that affects the accuracy of face recognition. To overcome the limitation of low-resolution facial images on face recognition, one effective solution is adopting super-resolution methods to reconstruct low-resolution images and then identify the generated facial images. However, existing super-resolution methods typically fail to consider facial identity preservation during reconstruction, which directly results in poor face recognition performance of reconstructed images. To address the issue mentioned above, this paper proposes a face super-resolution reconstruction method with identity preserving, called IPNet. This method can simultaneously improve the quality of low-resolution facial images and preserve the identity of reconstructed images. IPNet consists of a semantic segmentation network and a face generator. The semantic segmentation network is introduced to extract low-dimensional latent code and multi-resolution spatial features. Then, the extracted features guide the face generator to output super-resolution images similar to the authentic images. Furthermore, we introduce the face recognition network to integrate the face identity information into the super-resolution model, thus maintaining the identity of reconstructed facial images consistent with original images. Experimental results show that IPNet achieves better results than other comparison methods in terms of both super-resolution image quality and identity preservation, demonstrating effectiveness of the proposed method.
    29  Low-Quality Image Enhancement Based on Distance Weighted Color Cast Estimation
    CAO Siying ZHANG Xuan PU Tian PENG Zhenming
    2023, 38(1):141-149. DOI: 10.16337/j.1004-9037.2023.01.012
    [Abstract](833) [HTML](602) [PDF 2.24 M](1840)
    Abstract:
    Low-quality images under harsh atmospheric conditions such as colored fog, smoke and dust are characterized by low visibility and color cast, which bring difficulties to human observation and computer vision applications. Current enhancement algorithms for such images usually ignore the influence of the distance from the scene to the camera on the color cast. In order to better restore color while enhancing visibility, a relationship model between visibility reduction, color cast and distance and its solution method are proposed. First, the distance is estimated by the local brightness of the image, and the color cast matrix of the image is estimated by the distance. Then, the visibility and color restored image is obtained by solving the degradation model. Finally, the restored image is fused with a contrast limited adaptive histogram equalization (CLAHE)enhanced image by distance weighting for further detail enhancement. Experiments show that, compared with similar methods, the proposed method achieves high image quality evaluation indexes and has significantly better color recovery results.
    30  MAFDNet:A New Method of Image Adaptive Classification in Complex Environment
    YE Jihua LI Xin CHEN Jin JIANG Aiwen HUA Zhizhang WAN Wentao
    2023, 38(6):1392-1405. DOI: 10.16337/j.1004-9037.2023.06.014
    [Abstract](518) [HTML](402) [PDF 2.65 M](958)
    Abstract:
    In complex environments, difficult samples and simple ones often coexist. The existing classification methods are mainly designed for difficult samples, and the constructed network causes a waste of computing resources when it is used to classify simple ones. However, network pruning and weight quantization couldn’t take into account both accuracy and storage cost. To promote the efficiency of computing resources with better accuracy, focusing on the spatial redundancy of input samples, this paper proposes an adaptive image classification network MAFDNet in complex environment, introduces the confidence as the classification accuracy of judgment, and puts forward the adaptive loss function composed of the content loss, fusion loss and classification loss at the same time. MAFDNet consists of three subnets. The input images are first sent to the low-resolution subnet, which effectively extracts low-resolution features. Samples with high confidence are first identified and removed from the network in advance, while samples with low confidence need to enter the subnet with higher resolution in turn. The high resolution subnet in the network has the ability to identify difficult samples. MAFDNet combines resolution adaptive and depth adaptive. Through experiments, the top-1 accuracy of MAFDNet is improved in CIFAR-10, CIFAR-100 and ImageNet data sets under the same computing resource conditions.
    31  Hyperspectral Image Fusion via Deep Unfolding and Dual-stream Networks
    LIU Cong YAO Jiahao
    2023, 38(6):1406-1421. DOI: 10.16337/j.1004-9037.2023.06.015
    [Abstract](768) [HTML](503) [PDF 3.02 M](1175)
    Abstract:
    Hyperspectral image fusion algorithms based on deep learning typically stack multiple convolutional layers to learn mapping relationships, which suffer from the problems of not fully utilizing the characteristics of the task and lack of interpretability. To address these problems, this paper proposes a deep network combining deep unfolding and dual-stream networks. Firstly, an image fusion model is established using convolutional sparse coding, which maps low-resolution hyperspectral images (LR-HSI) and high-resolution multispectral images (HR-MSI) into a low-dimensional subspace. In the design of the fusion model, we consider the common information of LR-HSI and HR-MSI as well as the unique information of LR-HSI, and add HR-MSI to the model as auxiliary information. Next, the fusion model is unfolded into a learnable interpretable deep network. Finally, the dual-stream network is used to get more accurate high-resolution hyperspectral images (HR-HSI). Experiments prove that the network obtains excellent results in the hyperspectral image fusion task.
    32  Noise Estimation Based on Combined Spatial and Spectral Information for Hyperspectral Image
    ZHANG Qinming HUANG Danfei LIU Zhiying ZHONG Aiqi
    2023, 38(1):186-192. DOI: 10.16337/j.1004-9037.2023.01.016
    [Abstract](994) [HTML](451) [PDF 3.36 M](2018)
    Abstract:
    Obtaining accurate noise estimation in texture-rich hyperspectral images is difficult in the noise estimation task. A spectral decorrelation method based on the spatial regularity and spectral correlation of hyperspectral images is described in this paper. Homogenous region division is a key step in many noise estimation methods, and a precise homogeneous region division can effectively improve the accuracy of noise estimation. To this end, a simple linear iterative clustering algorithm is combined with spectral-spatial similarity to segment hyperspectral images into locally structured similar image blocks to maintain homogeneous features. Spectral information divergence and spectral angle are combined as the spectral distance measurement to improve the ability of discrimination between spectra. Spectral correlations are removed within homogeneous regions by multiple linear regression to obtain the noise levels of the residual images. Various degrees of noise are added to simulated images of varying ground complexity, and the effectiveness and stability of this method are verified by comparison with a variety of methods. Finally, the proposed method is successfully applied to noise level estimation of Urban data, and can accurately identify bands heavily polluted by noise.
    33  Image Caption Generation Model Based on Graph Neural Network and Guidance Vector
    TONG Guoxiang LI Yueyang
    2023, 38(1):209-219. DOI: 10.16337/j.1004-9037.2023.01.018
    [Abstract](943) [HTML](569) [PDF 3.09 M](1852)
    Abstract:
    In recent years, deep learning has shown its advantages in the research of image caption technology. In deep learning model, the relationship between objects in image plays an important role in image representation. In order to better detect the visual relationship in the image, an image caption generation model (YOLOv4-GCN-GRU, YGG) is constructed based on graph neural network and guidance vector. The model uses the spatial and semantic information of the detected objects in the image to build a graph, and uses graph convolutional network (GCN) as an encoder to represent each region of the graph. In the process of decoding, an additional guidance neural network is trained to generate guidance vector, so as to assist the decoder to automatically generate sentences. Comparative experiments based on MSCOCO image dataset show that YGG model has better performance, and the performance of CIDEr-D is improved from 138.9% to 142.1%.
    34  ValidFlow: Unsupervised Image Defect Detection Based on Normalizing Flows
    ZHANG Lanyao CHEN Xiaoling ZHANG Damin CEN Yigang ZHANG Linna HUANG Yansen
    2023, 38(6):1445-1457. DOI: 10.16337/j.1004-9037.2023.06.018
    [Abstract](885) [HTML](482) [PDF 2.10 M](1348)
    Abstract:
    The CS-Flow method based on normalizing flows has achieved good results in the field of defect detection, but its way of repeatedly stacking single coupling blocks increases the complexity of the network. Therefore, we propose a network ValidFlow composed of two coupling blocks stacking: Feature advection flow (FA flow) and feature blending flow (FB flow). In the subnetwork of FA flow, the short-cut branch of up and down sampling is removed and depth-separable convolution is introduced. The subnetworks within FB flow are fused across scales at three scales. This allows ValidFlow to reduce the number of parameters while keeping the information well mixed. Compared with the existing methods on MVTec AD,MTD and DAGM datasets, it can be seen that on MVTec AD datasets, the average AUROC of ValidFlow in 15 categories is 99.2%, and the AUROC of ValidFlow in four categories is 100%. On the MTD dataset, AUROC achieves 99.6%. At the same time, compared with CS-Flow, ValidFlow has 207.61M fewer parameters and 22 higher reasoning speed FPS. On the DAGM dataset, the average AUROC of the 10 categories is 99.0%, which is very close to the monitored method in terms of performance.
    35  Effect of Image Enhancement on Semantic Segmentation of Low-Light Scene
    Ai Yufeng Guo Jichang An Guanhua Zhang Yi
    2023, 38(4):959-977. DOI: 10.16337/j.1004-9037.2023.04.018
    [Abstract](776) [HTML](747) [PDF 7.01 M](1224)
    Abstract:
    Images acquired in low-light environments always suffer from low brightness, color distortion, loss of detail information, low contrast, and other problems. To meet the needs of subjective visual experience, researchers often enhance the images. However, the impact of image enhancement on the performance of machine vision applications is not systematically researched. In this paper, we first summarize typical low-light image enhancement methods and semantic segmentation methods. Next, we take a machine vision application (i.e., semantic segmentation) as an example and select the low-light scene to investigate the effect of image enhancement methods on the semantic segmentation performance of the low-light scene. The experimental results show that enhancement processing can improve the visual effect of images, but may introduce noise. In addition, image enhancement methods and semantic segmentation methods do not concentrate exactly on the same focus and features. Therefore, image enhancement doesnot contribute significantly to the performance of semantic segmentation in low-light scenes, and even brings negative effects.
    36  Parallel Computing of High-Speed DIC Under Large Deformation Field
    Chen Houchuang Ma Kun Xue Yuxuan Meng Zhi
    2023, 38(4):978-985. DOI: 10.16337/j.1004-9037.2023.04.019
    [Abstract](527) [HTML](478) [PDF 2.04 M](923)
    Abstract:
    Due to the effect of image decorrelation under large deformation fields, digital image correlation(DIC) has never been able to complete parallel computation between images. In order to break through this bottleneck, this paper proposes an accelerated-KAZE(AKAZE)-based reference image update method, which can complete the reference image update before DIC is officially calculated, and provide independent data for parallel computing. A graphics processing unit(GPU) parallel computing architecture is constructed, which can independently estimate all subsizes and complete the parallel computation between images and subsizes. Finally, tensile tests are performed on the nitrile butadiene rubber(NBR), and the results show that compared with the traditional serial DIC calculation method, the proposed parallel method can be increased by two orders of magnitude.
    37  Speech Steganalysis Method for Echo Hiding Based on Image of Cepstrum
    Tang Junhao Du Qingzhi Long Hua Shao Yubin Li Yimin
    2023, 38(6):1469-1481. DOI: 10.16337/j.1004-9037.2023.06.020
    [Abstract](785) [HTML](493) [PDF 2.88 M](1465)
    Abstract:
    After echo hiding, the cepstrum coefficient of a speech signal will peak at the echo delay. The traditional echo hiding steganalysis mainly uses the statistical characteristics of the cepstrum coefficient as the steganalysis feature. However, the peak value of the cepstrum coefficient of the steganography signal is not obvious when the echo amplitude is low, and the detection performance of the method based on the statistical characteristics is unsatisfactory. This paper combines cepstrum analysis with image recognition technology, and proposes an steganalysis method for speech echo hiding based on cepstrum image. The speech signal is divided into frames and windowed for cepstrum calculation. Then, the image is generated with time as the horizontal axis, cepstrum sequence points as the vertical axis, and cepstrum coefficient amplitude as the gray level. The generated cepstrum image is used as the steganalysis input, and residual neural network is used as the classifier for echo hiding steganalysis. The experimental results show that the detection accuracy of the three classical echo hiding algorithms reaches 98.2%, 98.6% and 96.1% respectively at low echo amplitude. The detection accuracy of this method at low echo amplitude is greatly improved compared with the traditional echo hiding steganalysis method, which solves the problem that the traditional echo hiding steganalysis method has unsatisfactory detection effect at low echo amplitude.
    38  Domain Generalization via Domain-Specific Decoding for Medical Image Segmentation
    Ye Huaize Zhou Ziqi Qi Lei Shi Yinghuan
    2023, 38(2):324-335. DOI: 10.16337/j.1004-9037.2023.02.009
    [Abstract](1424) [HTML](672) [PDF 3.11 M](2066)
    Abstract:
    Multi-source domain generalization (DG) aims to train a model uses semantic information of different domains and can be generalized to unknown domains. In the medical image, the gap between different domains is relatively large, and the model will suffer from performance drop in the unknown domain. In order to solve this problem, this paper proposes a network structure which encodes images for features and decodes domain specific features. The model uses a generic encoder, which learns all source domains for the domain-invariant features, and several domain-specific decoders to reconstruct the original images to promote the ability of extracting image features. Meanwhile, these decoders also help to generate transferred image to engage in adversarial learning with images of source domains in order to improve the encoder’s ability of learning invariant features. In addition, we also introduce a special Cutmix strategy which change foreground information of different domain images to augment the data set in the model to enhance the generalization ability of the model and further improve the performance of our network structure. In two medical image segmentation tasks, a large number of experimental data show that the proposed model has excellent performance compared with the existing advanced models. In addition, a series of ablation experiments are carried out to prove the effectiveness of the model.
    39  Realistic Medical Image Augmentation by Using Multi-loss Hybrid Adversarial Function and Heuristic Projection Algorithm
    WANG Jian CHENG Chufan CHEN Fang
    2023, 38(5):1104-1111. DOI: 10.16337/j.1004-9037.2023.05.009
    [Abstract](647) [HTML](630) [PDF 2.15 M](1147)
    Abstract:
    Early detection of COVID-19 allows medical intervention to improve the survival rate of patients. The use of deep neural networks (DNN) to detect COVID-19 can improve the sensitivity and speed of interpretation of chest CT for COVID-19 screening. However, applying DNN for the medical field is known to be influenced by the limited samples and imperceptible noise perturbations. In this paper, we propose a multi-loss hybrid adversarial function (MLAdv) to search the effective adversarial attack samples containing potential spoofing networks. These adversarial attack samples are then added to the training data to improve the robustness and the generalization of the network for unanticipated noise perturbations. Especially, MLAdv not only implements the multiple-loss function including style, origin, and detail losses to craft medical adversarial samples into realistic-looking styles, but also uses the heuristic projection algorithm to produce the noise with strong aggregation and interference. These samples are proven to have stronger anti-noise ability and attack transferability. By evaluating on COVID-19 dataset, it is shown that the augmented networks by using adversarial attacks from the MLAdv algorithm can improve the diagnosis accuracy by 4.75%. Therefore, the augmented network based on MLAdv adversarial attacks can improve the ability of models and is resistant to noise perturbations.
    40  Hyperspectral Image Denoising Based on Superpixel Block Clustering and Low-Rank Characteristics
    ZHANG Minghua WU Xuan SONG Wei MEI Haibin HE Qi SU Cheng
    2023, 38(3):549-564. DOI: 10.16337/j.1004-9037.2023.03.005
    [Abstract](763) [HTML](600) [PDF 10.70 M](1732)
    Abstract:
    Hyperspectral images are usually contaminated by Gaussian noise, impulse noise, dead lines and stripes. So, denoising is an essential step. The existing denoising methods based on low-rank characteristics introduce spatial information to improve the noise reduction effect. But because they often only use local similarity or non-local self-similarity, it has poor removal effect of sparse noise with structural information in the spectral dimension. Therefore, we propose a hyperspectral image denoising method based on superpixel block clustering and low-rank characteristics. The method realizes the adaptive partition and clustering of blocks, and makes full use of the non-local spatial self-similarity while retaining the local details. The experiments show that the same object block composed of clustered superpixel blocks has a good spatial-spectral dual low-rank attributes. Firstly, a superpixel segmentation method is applied to hyperspectral images, and the superpixel blocks are clustered to obtain the same object blocks. Secondly, the low-rank matrix restoration model is established and solved, and finally the denoised image is obtained. We conduct experiments on simulated data and real data respectively, and compare with other methods based on low-rank characteristics. The results show that this method has better denoising performance for mixed noise, especially sparse noise with structural information.
    41  Segmentation of Al-Si Alloy Microscopic Image by Fusing Class Attention
    SHEN Tao JIN Kai SI Changkai ZHENG Jianfeng LIU Yingli
    2023, 38(3):574-585. DOI: 10.16337/j.1004-9037.2023.03.007
    [Abstract](504) [HTML](647) [PDF 4.29 M](1329)
    Abstract:
    An improved model of class attention network (CA-Net) incorporating a class attention block (CAB) is proposed to extract the primary silicon regions of the microscopic images of Al-Si alloys in this paper. The correlation information of each channel to each class is calculated from the feature map by class attention block, and the correlation information of different classes is fused to generate attention weights for correlating the weights of feature channels with their contributions to the class in the task, thus the representation of important features is enhanced and the interference of irrelevant features is suppressed. Experiments are conducted on the Al-Si alloy microscopic image dataset, and the proposed method obtains results of 94.82%, 90.16%, 94.54%, 98.80%, and 97.97% for Dice coefficient, Jaccard similarity, sensitivity, specificity, and segmentation accuracy, respectively. The proposed CA-Net can effectively improve the segmentation effect of the primary silicon region in Al-Si alloy microscopic images compared with CCNet, SPNet, TA-Net, and other methods.
    42  Early Mycosis Fungoides Recognition Based on Multimodal Image Fusion
    XIE Fengying ZHAO Danpei WANG Ke LIU Zhaorui WANG Yukun ZHANG Yilan LIU Jie
    2023, 38(4):792-801. DOI: 10.16337/j.1004-9037.2023.04.004
    [Abstract](974) [HTML](778) [PDF 1.57 M](1125)
    Abstract:
    Early mycosis fungoides (MFs) may present as erythematous scaly skin lesions, which are difficult to distinguish from benign inflammatory skin diseases such as psoriasis and chronic eczema. This paper presents a new method based on multimodal image fusion for early mycosis fungoides recognition. The method adopts the ResNet18 network to extract features of single-modality images based on dermoscopic images and clinical images, designs the cross-modal attention module to achieve feature fusion of two modal images, and uses the self-attention module to extract the key information and reduce redundant information in the fusion features, thereby improving the accuracy of intelligent identification of early mycosis fungoides. Experimental results show that the proposed intelligent diagnosis model outperforms the comparison algorithms. At the same time, the proposed intelligent model is applied to the actual clinical diagnosis of dermatologists. Through the changes in the average diagnostic accuracy of the experimental group and the control group, it is confirmed that the proposed intelligent diagnostic model can effectively improve the clinical diagnosis level.
    43  Deep Learning Based Salient Object Detection: A Survey
    SUN Han LIU Yishan LIN Yuhan
    2023, 38(1):21-50. DOI: 10.16337/j.1004-9037.2023.01.002
    [Abstract](2638) [HTML](1374) [PDF 5.89 M](5021)
    Abstract:
    Salient object detection has been widely used in computer vision tasks such as image understanding, semantic segmentation, and object tracking by simulating the human visual system to find the most attractive targets for visual attention. With the rapid development of deep learning technology, salient object detection research has made great breakthroughs. This paper presents a comprehensive and systematic survey of salient object detection based on RGB images, RGB-D/T (Depth/Thermal) images, and light field images in the past five years. Firstly, the task characteristics and research difficulties of the three research branches are analyzed. Then the research technical route of each branch is expounded and the advantages and disadvantages are analyzed. At the same time, the mainstream datasets and common performance evaluation indexes of three kinds of research branches are introduced. Finally, possible future research trends are prospected.
    44  Infrared Ship Target Segmentation Based on Adversarial Domain Adaptation
    Gao Zihang Liu Zhaoying Zhang Ting Li Yujian
    2023, 38(3):598-607. DOI: 10.16337/j.1004-9037.2023.03.009
    [Abstract](941) [HTML](394) [PDF 2.15 M](1075)
    Abstract:
    To improve the segmentation accuracy of infrared ship target, we present an adversarial domain adaptation network for infrared ship target segmentation (ISADA), where the labeled visible ship images are used as the source domain and the unlabeled infrared ship images as the target domain. To address the issue of style difference between the two domains, we preprocess the visible images of the source domain in turn with graying and whitening to convert them into the images with the style of the target domain. For the infrared images in the target domain, we optimize them with a denoising network. Furthermore, to solve the matter of limited receptive field of the discriminative network, we design a discriminative network based on atrous convolution. Finally, for the problem of low confidence of the target domain prediction images, the information entropy of the target domain prediction images is added to the adversarial loss. The experimental results on the datasets composed of visible and infrared ship images is superior than the state-of-the-art methods, which demonstrates the effectiveness of the proposed method.
    45  Multi-scale Object Detection Based on Non-local Feature Fusion
    MA Qian ZENG Kai WU Jiawen SHEN Tao
    2023, 38(2):364-374. DOI: 10.16337/j.1004-9037.2023.02.012
    [Abstract](735) [HTML](495) [PDF 3.56 M](1748)
    Abstract:
    Aiming at the problem that the fusion method used by the existing multi-scale object detection model in the face of scale variation and occlusion scene is not sufficient, and does not capture the long-distance dependency relationship, channel feature fusion aggregation module and non-local feature interaction module are designed to learn the correlation between different channel features and capture the long-distance dependence between feature maps. In addition, the current detection architecture is based on single pyramid detection structure, which exists information loss. In this paper, a double pyramid structure is designed, and the proposed fusion method is combined with the double feature pyramid structure to supplement the fusion feature information on the basis of preserving the original feature information. Experimental results on public datasets KITTI and PASCAL VOC show that the proposed method has higher detection accuracy than other advanced work, proving its effectiveness in object detection task.
    46  SiamBM: Siamese Object Tracking Network for Better Matching
    Hu Zhaohua Liu Haonan Lin Xiao
    2023, 38(5):1079-1091. DOI: 10.16337/j.1004-9037.2023.05.007
    [Abstract](523) [HTML](847) [PDF 4.57 M](1077)
    Abstract:
    Object tracking algorithms based on Siamese networks usually adopt simple cross-correlation matching, but this simple matching method will introduce a lot of irrelevant information and weaken the response of the target region. Although the Siamese tracking network without anchor frame avoids the adjustment of anchor frame parameters, it cannot adapt well to the scale change of the target due to the loss of priori information. Therefore, aiming at the above problems, this paper proposes a object tracking matching enhancement algorithm SiamBM based on Siamese networks. By encoding the boundary frame coordinate information of the target, effective guidance information is provided for the tracking model. The discriminant ability of the tracking model is further improved by means of depth separable cross-correlation and cascade pixel matching cross-correlation. Multi-scale cross-correlation is adopted to enhance the scale adaptability of the tracking model. In the OTB100 dataset, the success rate and accuracy rate of SiamBM reached 0.684 and 0.906, respectively, which increased by 5.2% and 4.2% compared with the benchmark model. The experimental results show that compared with the current mainstream trackers, SiamBM has achieved quite competitive results and superior performance in various dataset indicators.
    47  Vortex Detection Based on Improved Anchor-Free Object Detection Algorithm
    Xuan Yang Lyu Hongqiang An Wei Liu Xuejun
    2023, 38(1):150-161. DOI: 10.16337/j.1004-9037.2023.01.013
    [Abstract](994) [HTML](617) [PDF 2.73 M](2004)
    Abstract:
    Vortex plays a crucial role in the formation and maintenance of various flow structures in fluid motion. The identification and detection of vortices are helpful to understand the flow laws. Traditional vortex detection methods have many shortcomings, such as inaccurate definition, heavy dependence on empirical threshold and poor generalization performance, which make vortex detection challenging. In this paper, a vortex detection model based on object detection algorithm is proposed from the perspective of computer vision. Aiming at the problem that the original object detection model has unsatisfactory detection accuracy on slender vortices with extreme aspect ratio, this paper analyzes the data characteristics of two different types of vortices. A feature adaptive module based on deformable convolutional network (DCN) and a slender sample mining method based on improved loss function are proposed. The cylindrical wake vortex and submarine tail vortex data sets are used to verify the proposed model. Experimental results show that the improved model improves the detection accuracy significantly, and the detection accuracy of slender vortex is especially significantly improved, which effectively balances the performance of various types of vortex detection.
    48  Improved Face Detection Algorithm Based on YOLOv3
    HU Yifan QIN Ling YANG Xiaojian
    2023, 38(5):1092-1103. DOI: 10.16337/j.1004-9037.2023.05.008
    [Abstract](614) [HTML](687) [PDF 2.76 M](1092)
    Abstract:
    Aiming at the low accuracy of face detection caused by the high similarity between background and face and the small scale of face target, an improved face detection algorithm based on YOLOv3 is proposed. Firstly, the K-means clustering algorithm based on genetic algorithm is used to improve the influence of random initialization in the original algorithm and generate a prediction frame more in line with the target size. Secondly, the lightweight network is used to improve the original feature extraction network and improve the face detection speed. Finally, the frame regression loss is used to replace the YOLOv3 coordinate loss function and the confidence loss function is improved to improve the training convergence speed and result accuracy. The accuracy and speed of the designed face algorithm are improved on Wider Face dataset.
    49  Unsupervised Person Re-identification Based on Camera-Aware Distance Matrix
    BAI Menglin ZHOU Fei SHU Haofeng
    2023, 38(5):1069-1078. DOI: 10.16337/j.1004-9037.2023.05.006
    [Abstract](544) [HTML](405) [PDF 1.53 M](970)
    Abstract:
    Cross-scene and cross-device shooting greatly increases the data of pedestrians. However, due to the different postures and partial occlusion of pedestrians, it is difficult to avoid the introduction of sample noise. During the clustering process, it is easy to generate false pseudo-labels, resulting in label noise and affecting the optimization of the model. In order to reduce the influence of noise, the camera-aware distance matrix is ??applied to combat the sample noise problem caused by camera offset, and the noise-robust dynamic symmetric contrast loss is used to reduce label noise. Specifically, the distance matrix that measures the similarity of pedestrian features is changed before clustering, and the camera-aware distance matrix is used to enhance the accuracy of the intra-class distance measurement, reducing the negative impact of different perspectives on the clustering effect. Combined with the noise label learning method, a robust loss is designed, a dynamic symmetric contrast loss function is proposed, and a joint loss training is used to continuously refine the pseudo-labels. Experiments are carried out on DukeMTMC-reID and Market-1501 datasets to verify the effectiveness of the proposed method.
    50  Unsupervised Learning Pedestrian Re-identification Based on Localized Instance Matching
    WU Haili Zhang Yueqin PANG Junqi
    2023, 38(4):947-958. DOI: 10.16337/j.1004-9037.2023.04.017
    [Abstract](611) [HTML](694) [PDF 2.44 M](1036)
    Abstract:
    Unsupervised domain adaptation (UDA) methods leverage global feature distribution matching to realize knowledge transfer from source domain to target domain, while ignoring fine-grained local instance information. An unsupervised person re-identification method based on two-tiered domain adaptation TTDA is proposed, in which the omni-scale network(OSNet) is selected as the backbone network, and global feature distribution matching and localized instance matching between source and target domains are performed jointly in an end-to-end deep learning framework. And in order to effectively mine transferable useful knowledge from associations of different pedestrian IDs between source and target domains, the cross-domain adaptability is improved with a knowledge selection mechanism. Experimental results on multiple large-scale public datasets show that compared with other state-of-the-art methods, the proposed method achieves significant improvements in terms of mean average precision (mAP) and top-k hit rate for unsupervised cross-domain person re-identification tasks.
    51  Person Re-identification Method Based on Improved Transformer Encoder and Feature Fusion
    ZHAO Qian XUE Chaochen ZHAO Yan
    2023, 38(2):375-385. DOI: 10.16337/j.1004-9037.2023.02.013
    [Abstract](849) [HTML](909) [PDF 2.69 M](1899)
    Abstract:
    In order to solve the problem of low accuracy of Transformer encoder caused by the loss of person image blocks information and insufficient expression of person local features in person re-identification, an improved Transformer encoder and feature fusion algorithm for person re-identification is proposed. This algorithm uses relative position encoding to solve the problem that Transformer will lose the relative position information of person image blocks during attention operation so that the network can focus on the semantic feature information of person image blocks, thus enhancing the ability to extract pedestrian features. Secondly, the local patch attention module is embedded into the Transformer network to weighted strengthen the local key feature information and highlight the significant features of the person area. Finally, the fusion of global and local information features is used to achieve complementary advantages between features and improve the recognition ability of the model. In the training stage, Softmax and triple loss functions are used to jointly optimize the network. The proposed algorithm is experimentally compared and analyzed on the mainstream datasets of Market1501 and DukeMTMC-reID. The Rank-1 accuracy reaches 97.5% and 93.5% respectively, and the mean average precision (mAP) reaches 92.3% and 83.1% respectively. The experimental results show that the improved Transformer encoder and feature fusion algorithm can effectively improve the accuracy of person re-identification.
    52  Video-Based Person Re-identification Algorithm Based on Feature Block Reconstruction
    WANG Jinhua ZHOU Fei BAI Menglin SHU Haofeng
    2023, 38(3):565-573. DOI: 10.16337/j.1004-9037.2023.03.006
    [Abstract](529) [HTML](480) [PDF 1.48 M](1192)
    Abstract:
    Video-based person re-identification (Re-ID) is to match a video track with a clipped video frame, so as to recognize the same pedestrian under different cameras. However, due to the complexity of the real scene, the collected pedestrian trajectories will have serious appearance loss and dislocation, and the traditional 3D convolution will no longer be suitable for the video pedestrian re-identification task. Therefore, a 3D feature block reconstruction model(3D-FBRM) is proposed, which uses the first feature map to align subsequent feature maps at the level of horizontal blocks. In order to fully mine the time information of the trajectory under the premise of ensuring the quality of the features, a 3D convolution kernel is added after the FBRM, and it is combined with the existing 3D ConvNets. In addition, a coarse-to-fine feature block reconstruction network(CF-FBRNet) is introduced, which not only enables the model to perform feature reconstruction in two different scales of spatial dimensions, but also further reduces computational overhead. Experiments show that the CF-FBRNet achieves state-of-the-art results on the MARS and DukeMTMC-VideoReID datasets.
    53  Facial Expression Recognition Under Complex Scenes Based on Multi-region Detection Network
    PAN Xinchen QIN Ling YANG Xiaojian
    2023, 38(6):1422-1433. DOI: 10.16337/j.1004-9037.2023.06.016
    [Abstract](748) [HTML](317) [PDF 1.86 M](1061)
    Abstract:
    Facial expressions are the most intuitive representation of human emotional states, and convolutional neural networks have shown excellent performance in facial expression recognition. However, occlusion and pose changes in complex scenes are still two major problems in automatic facial expression recognitionwhich significantly changes the appearance of faces and affects the final recognition results. Aiming at the problems of occlusion and pose change in facial expression recognition, a facial expression recognition method based on dual attention and multi-region detection network is proposed. Dual attention is used to improve the feature extraction capability of the overall network, enabling the network to focus on more detailed feature information. Multi-region detection is used to adaptively capture important local regions in facial expression recognition of occlusion and pose changes, and suppress the negative effects of occlusion and pose changes. Finally, the effectiveness of the proposed method is verified on three public natural scene facial expression datasets AffectNet, RAF-DB and SFEW.
    54  Person Re-identification Based on Feature Pyramid Branch and Non-local Attention
    Sun Minghao Wang Hongyuan Wu Linyu Zhang Ji Zhou Qunying
    2023, 38(1):121-131. DOI: 10.16337/j.1004-9037.2023.01.010
    [Abstract](933) [HTML](801) [PDF 1.58 M](1923)
    Abstract:
    Paying attention to the global contour and the person local details is very important for the existing person re-identification methods. In order to extract these more representative features, a person re-identification network method based on the feature Pyramid branches and the non-local attention modules is proposed to extract the global and local characterization features of person. Firstly, this method introduces a lightweight feature Pyramid branch structure, extracts features from the different network layers, and aggregates them into a two-way Pyramid structure. Secondly, in order to further improve the accuracy of person re-identification, the non-local attention module is used to extract the global features, which can not only obtain the global information of person, but also pay attention to the local details of person, so that their final fusion features are more representative. Finally, the characteristics of different layers are fused, and the joint loss function strategy is used to train the network model to significantly improve the performance of the backbone network. Through a large number of experiments on the four public person re-identification datasets, MSMT17, Market1501, DukeMTMC-ReID and PersonX, it is proved that the proposed method based on the feature Pyramid branch and the non-local attention is competitive compared with some advanced person re-identification methods.
    55  Dangerous Behavior Recognition Based on CNN-LSTM Dual-Stream Fusion Network
    GAO Zhijun GU Qiaoyu CHEN Ping HAN Zhonghua
    2023, 38(1):132-140. DOI: 10.16337/j.1004-9037.2023.01.011
    [Abstract](1449) [HTML](928) [PDF 1.25 M](1733)
    Abstract:
    To solve the problem of insufficient spatial and temporal feature in the process of dangerous behavior recognition, this paper improves the traditional dual-stream convolution model and proposes a new dual-steam convolution dangerous behavior recognition model based on CNN-LSTM. In this model, CNN network and LSTM network are connected in parallel. CNN network is used as the spatial flow. The spatial motion attitude information of human skeleton is divided into static and dynamic. These features are fused as the output of the spatial flow. In order to increase the ability of extracting temporal features of human skeleton, an improved temporal sliding LSTM network is used in the time stream. Finally, the two branches are fused in time and space, and the dangerous actions are classified and identified by Softmax. Experimental results on NTU RGB D and Kinetics datasets show that the average cross view(CV) accuracy of the improved model is 92.5% and the average cross subject(CS) accuracy is 87.9%. The proposed method is superior to that before improvement and other methods. It can effectively recognize dangerous human actions and has good discrimination effect for fuzzy actions.
    56  GPU-Based Real-Time Imaging Algorithm for Long-Track SAR
    TAN Yunxin HUANG Haifeng LAI Tao DAN Qihong OU Pengfei
    2023, 38(6):1380-1391. DOI: 10.16337/j.1004-9037.2023.06.013
    [Abstract](850) [HTML](558) [PDF 2.67 M](1295)
    Abstract:
    To meet the fast imaging requirements of long-orbit ultra-high resolution W-band synthetic aperture radar(SAR), this paper proposes a graphics processing unit(GPU)-based ω-K real-time imaging algorithm which adopts parallel architecture and double stream multithreading processing. The default stream processes data along the direction of the physical principle. Firstly, it parallelizes the rang compensation, error correction, zero filling and other operations, and then adopts one-layer nested interpolation method. By maintaining the upper and lower dependencies and synchronization management, it can achieve a speed ratio of about 30. The blocking stream starts at the same time as the default stream, generates the parameters and functions required by the default stream, and stores them into video memory before execution, which can greatly reduce the running time of the algorithm. Meanwhile, by setting events on the default stream, the two streams can be executed synchronously in parallel. Experimental results show that the total acceleration ratio of the algorithm can reach about 13, and the relative errors of amplitude and phase are close to 0, which not only has good real-time performance and focusing performance, but also maintains good imaging effect.
    57  Recent Advances in Visual Question Answering and Reasoning
    ZHANG Feifei ZHANG Jianqing QU Sijia ZHOU Wanting
    2023, 38(1):1-20. DOI: 10.16337/j.1004-9037.2023.01.001
    [Abstract](1744) [HTML](1362) [PDF 1.95 M](3325)
    Abstract:
    With the rapid development of the social media and human-computer interaction, the volume of multimedia data, such as video, image and text, has grown tremendously. Therefore, researchers have focused their attention on the multi-modal intelligence research. As an essential and fundamental research topic in the multi-modal intelligence and artificial intelligence area, some scientific research results on the visual question answering and reasoning task have been successfully implemented in the fields of human-computer interaction, intelligent medical care, and unmanned driving. This paper makes a comprehensive overview of the related algorithms of visual question answering and reasoning, meanwhile classifies and analyzes the existing methods. Firstly, we introduce the definition of the visual question answering and reasoning task, and briefly describe the main challenges of this task. Then, we summarize the existing methods that focus on attention mechanism, graph network, model pretraining, external knowledge and explainable reasoning mechanism. After that, we comprehensively introduce the common visual question answering and reasoning benchmarks and discuss the existing methods on these benchmarks in detail. Finally, we prospect future directions of the visual question answering and reasoning task.
    58  Recent Advances in Cross Modal Image Text Retrieval
    Zhang Feifei Ma Zewei Zhou Ling Meng Lingtao
    2023, 38(3):479-505. DOI: 10.16337/j.1004-9037.2023.03.001
    [Abstract](2281) [HTML](1760) [PDF 3.48 M](4116)
    Abstract:
    With the rapid development of Internet technology, the volume of different types of data has grown tremendously, such as texts and images. How to obtain valuable information from such heterogeneous but semantic related multimodal data is particularly important. Cross-modal retrieval is an essential way to meet users’ requirements for obtaining different information on the Internet, which can effectively deal with the multimodal data. In recent years, cross modal retrieval has become a hot issue in both academic and industrial area. In this paper, we make a comprehensive overview of the image-text cross modal retrieval task, including definitions, challenges, and detailed discussions about the existing methods. Specifically, we first divide the existing methods into three main categories: (1) traditional methods, (2) methods based on deep learning; and (3) Hash based representation method. Then, we introduce the commonly used cross-modal retrieval benchmarks and discuss the existing methods on these benchmarks in detail. Finally, the future development direction of image-text cross modal retrieval task is prospected.
    59  Generate Adversarial Depth Repair Under Structural Constraints
    Lu Qi Gong Xun
    2023, 38(5):1048-1057. DOI: 10.16337/j.1004-9037.2023.05.004
    [Abstract](605) [HTML](459) [PDF 2.89 M](973)
    Abstract:
    Unlike RGB images, pixels in depth images represent the distance from the acquisition device to the points of the scene, and the direct use of inpainting methods for the natural image can not effectively restore the scene structure of missing areas in deep images. This paper proposes a two-stage code structure generation counter-network to solve the problem of deep image inpainting. Unlike standard generative adversarial network (GAN) models, the generator network in this paper includes depth build G1 and depth repair G2 modules. G1 obtains the predicted depth from the RGB image, replacing the missing area of the depth image to be repaired, and ensuring the local structure consistency of the repair area. G2 introduces RGB image edge structure to ensure global structure consistency. The consistency of the missing areas, which is not considered in the existing image inpainting methods, is solved by a structure consistency attention module (SCA) embeded into G2. The proposed depth image repairing model is verified on several mainstream data sets, showing that the effect of structural constraints, and the combination of the generator and discriminator is evident.
    60  Improved Lightweight Traffic Sign Detection Algorithm of YOLOv5
    Jia Zihao Wang Wenqing Liu Guangcan
    2023, 38(6):1434-1444. DOI: 10.16337/j.1004-9037.2023.06.017
    [Abstract](691) [HTML](437) [PDF 3.82 M](1144)
    Abstract:
    With the rapid development of science and technology and artificial intelligence, people are more and more inclined to driverless technology. Considering the safety problem, aiming at the real-time detection of traffic signs during driving, the algorithm is improved on the basis of YOLOv5 model, and a lightweight traffic sign detection algorithm is proposed. The attention mechanism is added to the feature fusion part of the model, which can make the model more prominent target features. Then a lightweight sub-pixel convolution layer is added in front of the detection layer to effectively improve the resolution of the detection feature map without increasing the amount of computation. Finally, the loss function CIoU (Complete intersection over union) is improved, which speeds up the convergence speed of the network, and the convergence effect is better than that before the improvement. The experimental results show that the accuracy of this model reaches 90.6%, which is 14.5% higher than the basic network, and the detection speed reaches 70 frames / s, which basically meets the real-time accurate detection of traffic signs.
    61  White Matter Fiber Tract Segmentation Method Based on T1-Weighted Imaging
    JIAO Ruike ZHANG Xiaofeng YE Chuyang
    2024, 39(4):863-873. DOI: 10.16337/j.1004-9037.2024.04.007
    [Abstract](738) [HTML](1131) [PDF 2.69 M](763)
    Abstract:
    White matter fiber tract segmentation methods provide crucial neural pathway reference information for brain connectivity analysis by identifying white matter tracts connecting distinct brain regions. Traditional segmentation methods predominantly depend on diffusion magnetic resonance imaging (dMRI), but the lengthy acquisition time of dMRI severely restricts its clinical applicability. To address this limitation, this paper introduces a white matter fiber tract segmentation approach based on T1-weighted imaging. This method leverages the structural tensor of T1-weighted images to infer potential fiber orientations, thereby enhancing the segmentation accuracy of white matter tracts. Moreover, the proposed method incorporates privileged information from dMRI during model training to guide the learning process, thus improving the performance of the white matter tract segmentation model, and the segmentation of challenging tracts is improved significantly, with a 5% improvement in Dice score for the left fornix (FX_left) and a 6% improvement in Dice score for the right fornix (FX_right). This approach mitigates the limitations of conducting neural pathway analysis in the absence of dMRI, broadening the application scope of neural pathway analysis.
    62  Multi-scale SAR Image Detection Algorithm for Ships Based on Improved YOLOv5
    Li Shenghui Li Xiaofei Song Zhanghan Wang Bixiang
    2024, 39(1):120-131. DOI: 10.16337/j.1004-9037.2024.01.011
    [Abstract](959) [HTML](650) [PDF 2.38 M](1243)
    Abstract:
    An multi-scale synthetic aperture radar (SAR) image detection algorithm for ships based on improved YOLOv5 is proposed to address the large pixel scale difference of ship targets in complex scenes and missed detection caused by dense array of ships. For the neck network of YOLOv5, a bi-directional feature pyramid network (BiFPN) is adopted to enhance the multi-scale feature fusion ability of the network, and an enhanced channel-MLP (EC-MLP) module is constructed based on depthwise separable convolution (DSC) and channel MLP in its bottom-up feature fusion branch to enrich semantic information and provide more sufficient ship target context features. The global attention mechanism (GAM) is introduced to enable the network to extract input features selectively and reduce information reduction. In addition, the SIoU loss function is used to further improve the training convergence speed and detection accuracy of the network. Comparative experiments with eight other methods (Faster R-CNN, Libra R-CNN, FCOS, YOLOv5s, PP-YOLOv2, YOLOX-s, PP-YOLOE-s and YOLOv7-tiny) are conducted on SSDD and HRSID datasets. The experimental results show that the AP50 of the improved algorithm reaches 96.7% on SSDD and 95.6% on HRSID, which is superior to the comparison methods.
    63  Adaptive Transmissivity Correction Algorithm for Defogging Combining Image Texture Information
    SUN Jingrong CHEN Zhezhe WANG Jiankai SONG Shibin ZHAO Jing
    2024, 39(2):395-405. DOI: 10.16337/j.1004-9037.2024.02.012
    [Abstract](505) [HTML](540) [PDF 5.71 M](1030)
    Abstract:
    Image defogging algorithm is widely used in outdoor intelligent monitoring and traffic navigation fields. After defogging, the image clarity is improved to enhance the recognition effect of the target. Dark channel and its improved algorithm have errors in transmittance estimation in bright gray areas such as sky, and are prone to distortion and blurred image details, which will affect image recognition in intelligent transportation field. An adaptive transmittance defogging method is proposed to compensate the transmissivity. Logarithmic transformation is used to obtain logarithmic compensation operator to adjust the transmissivity in the depth of field area. The confidence of dark channel is calculated according to the richness of image information, and the texture compensation operator is constructed combining the image texture information. It can effectively improve the image distortion after defogging. Compared with other defogging algorithms, the proposed algorithm has improved the average gradient, signal-to-noise ratio (SNR), information entropy and other objective indicators. The image quality has been effectively improved with good transmission compensation effect for the gray bright area, clear and natural image details and moderate brightness.
    64  Image Captioning Method for Fusing Multi-temporal Dimensional Visual and Semantic Information
    CHEN Shanxue WANG Cheng
    2024, 39(4):922-932. DOI: 10.16337/j.1004-9037.2024.04.012
    [Abstract](635) [HTML](600) [PDF 1.01 M](693)
    Abstract:
    Traditional image captioning methods use only the visual and semantic information of the current moment to generate prediction words without considering the visual and semantic information of the past moments, which leads to the output of the model to be relatively homogeneous in terms of temporal dimension. As a result, the generated captioning is lacking in terms of accuracy. To address this problem, an image captioning method that fuses multi-temporal dimensional visual and semantic information is proposed, which effectively fuses visual and semantic information of past moments and designs a gating mechanism to dynamically select both kinds of information. Experimental validation on the MSCOCO dataset shows that the method is able to generate captioning more accurately, and the performance is considerably improved in all evaluation metrics when compared with the most current state-of-the-art image captioning methods.
    65  Image Inpainting Based on Perceptual Inference and External Spatial Prior Features
    WU Peng ZHANG Sunjie WANG Yongxiong CHEN Yuanfeng QIN Haiwang
    2024, 39(4):933-943. DOI: 10.16337/j.1004-9037.2024.04.013
    [Abstract](707) [HTML](749) [PDF 4.41 M](840)
    Abstract:
    Image inpainting based on deep learning has made a lot of remarkable progress. However, when there is a large area mask, due to the lack of reasonable prior information guidance, the repair results often appear artifacts and blurred textures. Therefore, we propose an image inpainting algorithm that combines prior features with image predictive filtering. It consists of two branches: Image filtering kernel prediction branch and feature inference and image filtering branch. The features are extracted from the decoder part of the image filter kernel prediction branch. The multi-scale external spatial feature fusion is used to reconstruct the mask region features, and the decoding stage is passed to another branch as a prior feature to provide richer semantic information for image inpainting. Then, a spatial feature-aware inference block is introduced in the feature inference and image filtering branches, which can filter out the distracting features and capture the informative long-distance image context for inference. Finally, the image prediction filter kernel is used to filter and eliminate artifacts. Compared with other repair networks on CelebA and Places2 datasets, the superiority of the method in repair quality is proved.
    66  Medical Image Segmentation Method with Integrated Self-attention
    ZHAO Fan ZHANG Xuedian
    2024, 39(5):1240-1250. DOI: 10.16337/j.1004-9037.2024.05.015
    [Abstract](1064) [HTML](889) [PDF 2.15 M](972)
    Abstract:
    Aiming at the limitations of the UNet architecture in capturing local features and preserving edge details in medical image segmentation, this paper presents an improved UNet algorithm integrating self-attention mechanism. The proposed algorithm is based on traditional encoder-decoder structure, incorporating a multi-scale convolution (MSC) block for multi-granularity feature extraction, and a convolution mixer attention (CMA) block, which combines the modeling of local features by convolutional layers with global contextual modeling by self-attention layers. In the segmentation task of BUSI and DDTI datasets, compared with the existing classical network architecture, a large number of experimental data verify the excellent segmentation ability of the model. Additionally, Statistical data analysis and ablation studies further confirm the effectiveness of the MSC and CMA modules. This research provides an innovative approach for high-precision medical image segmentation, holding significant theoretical and practical implications for enhancing the accuracy and efficiency of medical diagnoses.
    67  Blind Face Restoration Algorithm Based on Feature Fusion and Embedding
    HUO Zhiyong HU Shanlin
    2024, 39(3):609-616. DOI: 10.16337/j.1004-9037.2024.03.009
    [Abstract](724) [HTML](726) [PDF 2.70 M](946)
    Abstract:
    Blind face restoration is to recover high quality face from unknown degradation, and the ill-posed problem often results in local texture missing or mismatched facial components for restored images, therefore a degraded blind face restoration algorithm based on feature fusion and embedding optimization is proposed. By extracting face prior features from degraded inputs, using multi-headed cross-attention for feature interaction fusion and global context modeling, embedding facial priors into the latent space of pre-trained generative networks, and carrying out optimization based on loss functions, local textures lost or damaged due to degradation are repaired to achieve a balance between realism and fidelity. Numerical experiments are conducted on three real degraded datasets, which outperform existing methods in terms of objective metrics and subjective quality, and the final ablation experiments validate the effectiveness of the degraded blind face restoration algorithm.
    68  A Two-Step Adversarial Sample Detection Technique for SAR Image Classification
    WANG Jian ZHANG Sainan CHEN Fang
    2024, 39(1):106-119. DOI: 10.16337/j.1004-9037.2024.01.010
    [Abstract](552) [HTML](723) [PDF 5.51 M](1032)
    Abstract:
    Deep learning techniques have greatly improved the classification accuracy of synthetic aperture radar (SAR) images target, but the security of SAR image classification systems is threatened by the inherent vulnerability of neural networks. In this paper, we analyze the aggressiveness of SAR adversarial samples, and the difference between SAR adversarial examples and original examples in the frequency domain. With the analysis results, a two-step SAR adversarial samples detection technique is proposed to improve the security of SAR classification models. The first step of adversarial sample detection is performed on the input image based on the frequency domain analysis to separate the adversarial samples. Then, the remaining images are fed into an adversarial trained model and an untrained model to complete the second step of adversarial sample detection. By using this two-step detection method, the adversarial samples can be effectively detected with a detection success rate of no less than 95.73%, effectively improving the security of the SAR classification model.
    69  Two Image Rectification Networks for Distorted and Warped Documents
    FENG Jin CHI Yue ZHOU Yatong HE Jingfei
    2024, 39(1):167-180. DOI: 10.16337/j.1004-9037.2024.01.015
    [Abstract](869) [HTML](806) [PDF 6.42 M](1205)
    Abstract:
    Due to the geometric distortion of the document paper, the interference from the shooting scene, and perspective distortion brought on by the unfavorable shooting angle, the optical character recognition (OCR) quality of document photos taken by mobile devices has been severely hampered. Two networks based on auto-encoder are created to perform adaptive image correction and increase the accurate rate of text recognition in order to handle pre-processing distorted document images with folding and distortion. First, we propose two different types of residual blocks: dilated residual blocks and asymmetric convolutional residual blocks, and then combine the residual blocks with the auto-encoder to create an asymmetric dilated auto-encoder. In the meantime, we create a spatial pyramid auto-encoder by using spatial pyramid pooling instead of fully connected layers and implementing feature extraction with asymmetric convolutional residual blocks. Experimental results show that, compared with distorted images, the corrected images by the asymmetric dilated auto-encoder respectively improve by 26.3%, 20.4% and 12.3% in OCR precision, OCR recall, and text similarity. Besides the corrected images by the spatial pyramid auto-encoder respectively improve by 27.7%, 22.0% and 15.5% in OCR precision, OCR recall, and text similarity. Compared with other image rectification networks such as RectiNet, the corrected images by these two auto-encoders perform much better on optical character recognition. The corrected document images of both asymmetric dilated auto-encoder and spatial pyramid auto-encoder are effectively improved in terms of OCR precision, OCR recall, and text similarity. Not only that, they have relatively obvious advantages over existing networks in terms of robustness and generalizability.
    70  Small Target Detection in UAV Aerial Images Based on High Resolution Feature Enhancement
    ZHOU Xuan GE Qi SHAO Wenze
    2024, 39(4):908-921. DOI: 10.16337/j.1004-9037.2024.04.011
    [Abstract](1090) [HTML](819) [PDF 5.34 M](896)
    Abstract:
    Aiming at the problem of low detection accuracy caused by complex background and dense distribution of small size targets in unmanned aerial vehicle (UAV), this paper proposes a small target detection algorithm based on high resolution feature enhancement. Firstly, a high-resolution feature enhancement network is proposed, which expands the scale of the output feature map by reducing the sub-sampling times of the backbone. At the same time, the bilinear interpolation is introduced to reduce the loss of feature information after up-sampling, thereby preserving more semantic and detailed features. Secondly, the spatial pyramid pooling-fast module combined with the cross stage partial structure is embedded in the backbone to enhance the information fusion of local and global features, so as to obtain a larger receptive field. Finally, the mosaic-mixup data enhancement method is used to enhance the complexity of image background and improve the generalization ability of the model. Experimental results on the public dataset VisDrone 2019 show that compared with other mainstream algorithms such as the “ you only look once ”(YOLO) series, the mean average precision of the proposed algorithm has significantly improved. The advantages of the proposed algorithm have been verified in different scenarios, indicating that the algorithm has strong practicality for dense small target detection tasks in UAV aerial images.
    71  Point Spread Function Engineering in Computational Imaging Technology
    QIAO Minda BAI Linge WANG Shuheng WANG Tianyu DONG Xue XIANG Meng LIU Fei LIU Jinpeng SHAO Xiaopeng
    2024, 39(2):271-296. DOI: 10.16337/j.1004-9037.2024.02.003
    [Abstract](2296) [HTML](1958) [PDF 8.93 M](2298)
    Abstract:
    This paper focuses on the new connotation and application of point spread function (PSF)of optical imaging in computational imaging. Firstly, the conception of PSF in traditional optical imaging and the key role of PSF in optical system design are introduced, and several algorithms for imaging restoration using PSF and imaging evaluation indices are briefly explained. On this basis, the connotation of PSF is re-examined from the perspective of information transfer under the framework of computational imaging, and relevant researches in the field of computational imaging are summarized from the two aspects of narrow and generalized optical systems. Finally, the application prospect and development trend of PSF engineering technology are prospected.
    72  MSDAB-DETR: A Multi-scale Remote Sensing Target Detection Algorithm
    LI Ye ZHOU Shengcui ZHANG Chi
    2024, 39(6):1455-1469. DOI: 10.16337/j.1004-9037.2024.06.014
    [Abstract](918) [HTML](1011) [PDF 2.68 M](579)
    Abstract:
    Due to the large differences of target size in remote sensing images and the difficulty in effectively capturing the effective features of targets at different scales, it is difficult to effectively identify targets at different scales. And, when dealing with high-resolution images, traditional Transformers may face the problem of insufficient computational resources. In addition, the combination of a single loss calculation method and the Hungarian algorithm can increase the fluctuation of cost loss and affect the convergence speed and accuracy of the algorithm. Therefore, a multi-scale remote sensing target detection algorithm, named as MSDAB-DETR, is proposed. Firstly, the algorithm creates a new multi-scale attention fusion module to leverage the differences between different resolution feature information to achieve multi-scale prediction of remote sensing images. Secondly, an efficient attention mechanism is adopted to improve the self-attention mechanism in the Transformer model, reducing the memory footprint of the original model. Finally, the SIoU loss function is used as the bounding box regression loss, combined with the Hungarian algorithm, to weaken the fluctuation of binary graph matching, accelerate the convergence speed, and further improve the regression ability of bounding boxes. Experimental results show that the detection accuracy of this method on the NWPU VHR-10 and DIOR datasets is as high as 95.3% and 71.5%,respectively. Among them, on the NWPU VHR-10 dataset, the average detection accuracy for small, medium, and large-scale targets is improved by 10.5%, 1.8%, and 2.7%,respectively compared to the DAB-DETR model. At the same time, the memory footprint is reduced by about 9%.
    73  Fourier Single-Pixel Imaging Method Based on Adaptive Sampling of Spectral Features
    XIAO Zhenkun ZHANG Yongfeng WEI Wenqing DENG Hu
    2024, 39(2):324-336. DOI: 10.16337/j.1004-9037.2024.02.006
    [Abstract](896) [HTML](646) [PDF 6.03 M](1114)
    Abstract:
    The improvement of imaging efficiency in Fourier single-pixel imaging (FSI) is mainly achieved with the help of optimized reconstruction algorithms and optimized sampling methods. However, with a limited number of samplings, FSI cannot accurately sample critical frequencies, resulting in poor imaging quality. To solve this problem, a strategy for adaptive sampling of spectral features is proposed. First, the degree of concentration of energy in the Fourier domain is investigated as a way to determine the optimal radius of low-frequency equidistant pre-sampling, and further, the corresponding Fourier coefficients are measured by means of pre-sampling the low-frequency components to estimate the key spectral positions, which ultimately realizes the image reconstruction. Compared with the adaptive sampling method based on energy continuity in the high-frequency direction, this method can adaptively select better sampling paths for different spectral feature targets, obtain the key Fourier coefficients, and then improve the imaging quality, with a peak signal-to-noise ratio increase of 2.28 dB and a structural similarity increase of 15.83%. Therefore, this method has the advantage of efficient spatial information acquisition in response to FSI of unknown feature targets, and is expected to be applied in single-pixel fast real-time imaging.
    74  Few-Shot Learning Method Based on Class Enhancement and Multi-scale Adaptation
    Dong Chijing Zhang Sunjie Ren Han
    2024, 39(3):689-698. DOI: 10.16337/j.1004-9037.2024.03.016
    [Abstract](625) [HTML](602) [PDF 1.55 M](937)
    Abstract:
    In order to solve the problems of the insufficient feature information extraction and the difficulty in capturing local obvious feature information accurately in few-shot learning, a method combining class enhancement and multi-scale adaptation is proposed. Firstly, the class enhancement is performed on the image at the level of features, and rich semantic structures are encoded by associating each activation of the feature map with its neighborhood, thus making the extracted intra class features obvious and more conducive to the current classification task. Secondly, low-level representations of image features at different scales are extracted through multi-scale feature generation. Finally, the semantic correlation matrix on each scale is weighted and similarity elements are maximized to calculate the semantic similarity between the query image and each support set category image. After the fusion of multi-scale information, the target images are classified. In the 5-way 1-shot and 5-way 5-shot settings, the mean average precision (mAP) of this method on the miniImageNet dataset is 56.83% and 75.76% respectively, and it achieves 79.33% and 93.92%, 66.33% and 85.78% on the commonly used fine grained image dataset Standard Cars and CUB-200-2011 classification benchmarks, respectively, which are superior to the best results of the existing methods.
    75  Real-Time Semantic Segmentation of Road Scene Based on Multi-level Attention Feature Optimization
    ZHANG Peng PENG Zongju ZHANG Wenrui LUO Yingguo WEI Wei WANG Peirong
    2024, 39(6):1505-1516. DOI: 10.16337/j.1004-9037.2024.06.018
    [Abstract](693) [HTML](594) [PDF 3.81 M](604)
    Abstract:
    Aiming at the problems of overlapping targets in complex and changeable road scenes, it is difficult to segment image edges and extract small target features. A multi-level attention feature optimization method for real-time semantic segmentation of road scenes is proposed. Firstly, a lightweight residual attention module is designed, taking into account the difference in feature weights at different levels, and optimizing local features of the image through a compressed attention mechanism, thereby improving the edge effect between pixels. Then, the channel attention and depth aggregation pyramid pooling module are designed to further strengthen the extraction of semantic context information, thereby solving the problem of small target information loss. Finally, the attention fusion module is designed to fuse feature information at different scales from top to bottom. It can achieve effective interaction of global feature information and enhance the network’s expression of important features. Experimental tests are carried out on the Cityscapes and CamVid road scene datasets, and the segmentation accuracy is 74.4% and 67.7%, respectively, and the inference speed are 138 frames/s and 148 frames/s. Compared with the excellent methods in recent years, this method improves the loss of image edge information and optimizes the segmentation accuracy of small objects in the image.
    76  Polyp Segmentation Network Based on Multiple Attention and schatten-p Norm
    LI Su LIU Guoqi LIU Dong ZHAO Manqi
    2024, 39(1):223-235. DOI: 10.16337/j.1004-9037.2024.01.020
    [Abstract](701) [HTML](520) [PDF 4.76 M](1066)
    Abstract:
    Automatic and accurate polyp localization and segmentation methods can detect polyps in a timely manner in the early stage of colorectal cancer lesions, greatly reducing the risk of cancer transformation. The encoder-decoder architecture, as the most mainstream network structure in polyp segmentation in recent years, has been greatly improved, such as improving the model’s ability to capture global contextual and local features, and using deep features to guide shallow decoding. However, polyps vary in shape and size, and due to their convolutional nature, they are prone to getting too caught up in local information mining and losing remote information dependencies during encoding. Some polyp images also have low contrast and complex spatial characteristics, which makes it easy to confuse the polyp with the background. Based on this, this paper proposes a polyp segmentation network based on multiple attention and schatten-p norm(MASNet). Among them, the axial multiple attention module utilizes axial attention to supplement remote contextual relationships in the image, while also paying attention to boundary and background information to achieve feature complementarity. It enhances the capture of local detail features while paying attention to global features. By utilizing the correlation between matrix singular values and matrix implicit information, the schatten-p norm is introduced as a constraint to analyze the data from a matrix perspective and assist the model in distinguishing foreground and background. By setting up a large number of experiments, the effectiveness of the proposed method is proven, and MASNet achieves the best segmentation results by comparing different advanced methods on the Kvasir-SEG dataset.
    77  Three-Dimensional Reconstruction Method for Single-View Optical Remote Sensing Images Based on Semantic Segmentation and Residual U-Net Fusion
    HUANG Hua ZHU Yuxin ZHANG Li CHEN Zhida ZHANG Yizhi WANG Bo
    2024, 39(2):348-360. DOI: 10.16337/j.1004-9037.2024.02.008
    [Abstract](782) [HTML](813) [PDF 6.12 M](1079)
    Abstract:
    Three-dimensional (3D) reconstruction from single-view remote sensing images is an unsolvable problem, which often requires a lot of manual experience to supplement the missing information to construct a complete 3D model. To solve this problem, a 3D reconstruction method of single-view remote sensing image based on semantic segmentation and fusion residual U-Net is proposed. The method includes two stages: Semantic segmentation and height estimation of single-view remote sensing images. In the semantic segmentation stage, U-Net is used to determine the property of ground objects. On this basis, U-Net is improved to estimate the height of remote sensing image. The anchoring height regression is combined with semantic features to improve the reconstruction accuracy. Specifically, in order to improve U-Net, the feature extraction capability of encoder is enhanced by embedding residual blocks with different numbers and channels, and the decoder output layer is modified to adapt to the height regression task, so as to achieve pixel-to-pixel prediction of digital surface model (DSM) height values of remote sensing images. The results of root mean square error (RMSE) of 2.751 m and mean absolute error (MAE) of 1.446 m are obtained on the published US3D data set, and the reconstructed results are superior to those of other networks, confirming that the method can realize 3D estimation based on single-view remote sensing images and can reconstruct the distribution structure of ground objects.
    78  High Dynamic Range 3D Recontruction Based on Event Information and Deep Learning
    Wang Jie Wei Zhendong Wang Qijiang Zhang Qican Wang Yajun
    2024, 39(2):337-347. DOI: 10.16337/j.1004-9037.2024.02.007
    [Abstract](990) [HTML](1049) [PDF 3.90 M](1131)
    Abstract:
    Three-dimentional(3D) measurement of high dynamic range (HDR) surfaces using optical 3D imaging technology, such as metal parts, black objects, and translucent objects, remains a challenging problem. Currently, traditional methods have limitations in reconstructing HDR scenes with low reflection and translucent areas, as well as difficulty in eliminating internal reflection noise of translucent objects. Existing deep learning-based methods typically use strong laser intensification, which can potentially damage the sample and result in overexposure of the acquired image, necessitating tedious adjustments to the laser intensity. To address these issues, this paper proposes a 3D measurement method for HDR scenes utilizing an event camera and the deep learning algorithm. By asynchronously recording the brightness changes of individual pixels, the event camera is with a high dynamic range response, and thus has the ability to fully capture the laser fringe of HDR scenes. In addition, we introduce a deep convolutional neural network (DCNN) to eliminate the noises caused by the reflections inside transparent objects and overexposure area of high reflection from metallic objects, while enhancing the weak laser stripes on the surface. Experimental results demonstrate that the proposed method can successfully achieve high-quality 3D reconstruction of HDR scenes utilizing low-power line laser scanning.
    79  Fast 3D Imaging of Small Solar System Bodies Based on FFBP Algorithm
    Hu Chaoran Wei Mingchuan
    2024, 39(2):312-323. DOI: 10.16337/j.1004-9037.2024.02.005
    [Abstract](621) [HTML](653) [PDF 3.64 M](995)
    Abstract:
    Radar imaging technology has attracted increasing attention in the field of deep space exploration due to its fast, non-destructive, and high-resolution characteristics. To address the issue of low computational efficiency in synthetic aperture radar (SAR) 3D imaging, a fast factorized back-projection (FFBP) 3D imaging algorithm suitable for slow flyby observation modes is proposed leveraging the weak gravity and rapid spin characteristics of small solar system bodies. Initially, the equivalent motion model under slow flyby mode is analyzed, extending the imaging domain from a 2D polar coordinate system to a 3D spherical coordinate system. An in-depth analysis of aperture division and image fusion issues within the 3D FFBP algorithm is conducted, deriving rules for 2D sub-aperture division and recursive image fusion methods, along with a detailed implementation process. Finally, the effectiveness of the algorithm is validated through numerical simulations and measured data. Experimental results show that the proposed imaging algorithm significantly enhances computational efficiency. Compared to the back-projection (BP) algorithm, it can achieve a speedup ratio of 30—50 times while obtaining imaging performance comparable to the classical BP algorithm.
    80  Detection and Classification of Banded Carbide in Steel Based on Improved Cascade R-CNN
    HAO Liang ZHOU Shiyang MO Yunyang CHEN Yongyong XU Yong SU Jingyong
    2024, 39(5):1228-1239. DOI: 10.16337/j.1004-9037.2024.05.014
    [Abstract](855) [HTML](797) [PDF 4.23 M](984)
    Abstract:
    In the steel industry, carbide is a vital constituent, whose distribution in steel materials holds significant reference value for evaluating steel quality. However, the current detection methods for carbide in steel bars primarily rely on manual inspection, which is costly and lacks stability. This study introduces advanced deep learning techniques from the domain of artificial intelligence, which collects and annotates 3 192 high quality images of banded carbides on steel bars, alongside 11 complete samples to create a banded carbide dataset on object detection for steel bars (BCDOD). Common deep learning methods for object detection are applied to the dataset through experimental analysis. With a focus on the specific characteristics of the application scenario and data, the cascade R-CNN model is enhanced with rotation data augmentation, improvement to the Focal Loss function and negative sample fine-tuning, resulting in performance improvement. The achieved average precision reaches 96%, with 100% recognition accuracy on complete sample data, showcasing promising outcomes that address the existing gap in artificial intelligence technology within the field of carbide metallographic detection.
    81  A Double-Decoding Model for Polyp Segmentation Based on Feature Fusion
    WU Gang QUAN Haiyan
    2024, 39(4):954-966. DOI: 10.16337/j.1004-9037.2024.04.015
    [Abstract](631) [HTML](713) [PDF 2.84 M](951)
    Abstract:
    In the early screening of colorectal cancer, diagnostic efficiency and accuracy can be improved by automated polyp detection and segmentation of colonoscopy images. Due to the complexity of internal environment of intestines and the limitation of image quality, automated polyp segmentation is still a challenging problem. Aiming at this problem, this paper proposes a dual-decoding model for polyp segmentation using Transformer and null convolution to achieve feature fusion (FTDC-Net). ResNet50 is used as an encoder in order to be able to better extract deep image features. The Transformer coding module is used, which has a self-attention mechanism to capture long distance dependencies between the inputs, and different dilated-convolutions are used in the model to expand the sensory field of the model to allow the model to capture a larger range of information in the colonoscopy image. The decoding part of the network model in this paper uses a dual-decoding structure, including an autoencoder branch that reconstructs the inputs and a coding branch for segmenting the results. The output of the autoencoder is used in the model to generate an attention map as an attention mechanism. This map will be used to guide the segmentation results. Experimental validation is carried out on the Kvasir-SEG and ETIS-LARIBPOLYPDB standard datasets, and experimental results show that FTDC-Net can effectively segment colon polyps, and achieves a high level of improvement in all evaluation metrics compared to the current mainstream polyp segmentation models.
    82  Unmanned Aerial Vehicle Landing Area Detection Based on Onboard Video
    CAO Yanan LI Minglei LI Jia CHEN Guangyong YE Fangzhou
    2024, 39(6):1445-1454. DOI: 10.16337/j.1004-9037.2024.06.013
    [Abstract](734) [HTML](501) [PDF 2.58 M](537)
    Abstract:
    Improving the autonomous landing capability of unmanned aerial vehicles (UAVs) holds significant importance in enhancing their operational efficiency and survival ability in the field. This paper presents a novel approach utilizing onboard video for automatic detection of UAV landing zones, aiming to enhance the UAV’s autonomous obstacle avoidance and landing capabilities in the absence of prior scene knowledge. We integrate a deep learning network incorporating multi-view geometric constraint methods into the simultaneous localization and mapping (SLAM) algorithm, aiming to construct a three-dimensional map of the scene while actively identifying potential obstacles. Subsequently, we propose a landing area detection algorithm that takes into account factors such as landing area and flatness. By conducting spatial analysis on voxel grid maps, we can identify the landing area of UAVs. This algorithm utilizes spatial analysis on a voxel grid map to identify the suitable landing area for the UAV. Experimental evaluation is conducted in various scenarios, demonstrating the accuracy of the proposed approach.
    83  Weakly Supervised Video Anomaly Detection Based on Spatio-Temporal Dependence and Feature Fusion
    LIU Deyun LI Ying ZHOU Zhen JI Genlin
    2024, 39(1):204-214. DOI: 10.16337/j.1004-9037.2024.01.018
    [Abstract](743) [HTML](593) [PDF 2.44 M](1086)
    Abstract:
    Weakly supervised video anomaly detection has become a hot spot in video anomaly detection research due to its strong anti-interference and low data labeling requirements. In the existing methods, most of the weakly supervised video anomaly detection methods assume that the clips in each video distribute independently, and determine whether it is abnormal for each video clip independently, ignoring the temporal and spatial information between video clips. To alleviate these problems, this paper proposes a weakly supervised anomaly detection method based on spatio-temporal dependence and feature fusion. Retaining the original characteristics of video clips, this method uses the distance of index and the similarity of features between video clips to fit the time dependence and the spatial dependencies of video, which builds the relationship characteristics of video clips. By fusing the original features and relationship features, the dynamic characteristics and temporal relationship of videos can be better expressed. Extensive experiments on two benchmark datasets, UCF-Crime and ShanghaiTech, demonstrate that the proposed method outperforms other methods with the AUC values reaching 80.1% and 94.6%, respectively.
    84  Unsupervised Video Person Re-identification Based on Multiple Kernel Dilated Convolution
    LIU Zhongmin ZHANG Changkai HU Wenjin
    2024, 39(5):1192-1203. DOI: 10.16337/j.1004-9037.2024.05.011
    [Abstract](822) [HTML](685) [PDF 3.15 M](803)
    Abstract:
    Person re-identification aims to identify specific individuals across surveillance cameras, overcoming challenges such as pose variations, occlusions, and background noise that often lead to insufficient feature extraction. This paper proposes a novel unsupervised video-based person re-identification method that utilizes multi-kernel dilated convolution to provide a more comprehensive and accurate representation of individual differences and features. Initially, we employ a pre-trained ResNet50 as an encoder. To further enhance the encoder’s feature extraction capability, we introduce a multiple kernel dilated convolution module. Enlarging the receptive field of convolutional kernels allows the network to more effectively capture both local and global feature information, offering a more comprehensive depiction of a person’s appearance features. Subsequently, a decoder is employed to restore high-level semantic information to a more fundamental feature representation, thereby strengthening feature representation and improving system performance under complex imaging conditions. Finally, a multi-scale feature fusion module is introduced in the decoder output to merge features from adjacent layers, reducing semantic gaps between different feature channel layers and generating more robust feature representations. Offline experiments are conducted on three mainstream datasets, and results show that the proposed method achieves significant improvements in both accuracy and robustness.
    85  Research Progress on Application of Computational Imaging in Holographic Storage Phase Retrieval
    HAO Jianying LIN Yongkun LIU Hongjie CHEN Ruixian SONG Haiyang LIN Dakui LIN Xiao TAN Xiaodi
    2024, 39(2):297-311. DOI: 10.16337/j.1004-9037.2024.02.004
    [Abstract](970) [HTML](858) [PDF 5.25 M](5020)
    Abstract:
    Holographic storage technology, as a kind of data storage technology with three-dimensional volume storage and two-dimensional data transmission, is characterized by high storage density and fast data transmission, which is one of the powerful solutions for long-term storage of massive data. The traditional holographic storage method is limited by the fact that the photodetector only responds to intensity, and is usually modulated by pure amplitude coding. However, utilizing only amplitude information cannot fully exploit the advantages of holography itself, and how to decode the phase information in a simple, fast, stable and accurate way is a real problem faced by holographic storage technology. Computational imaging opens a new way to solve phase retrieval problem for holographic storage technology because of its algorithmic versatility, high perceptual dimension characteristics and so on. This paper mainly reviews some work in recent years on solving the phase retrieval problem of holographic storage using computational imaging technology from the perspectives of iterative computational phase retrieval and deep learning phase retrieval. Analyses are conducted on the work from the perspectives of improving storage density, data reading speed, and data reading stability. Finally, we make an outlook on the future development of this direction.
    86  Research Progress of Computational Enhanced Optical Coherence Tomography
    Qiao Zhengyu Huang Yong Hao Qun
    2024, 39(2):248-270. DOI: 10.16337/j.1004-9037.2024.02.002
    [Abstract](1303) [HTML](887) [PDF 8.05 M](1891)
    Abstract:
    Optical coherence tomography (OCT) has become an important non-invasive three-dimensional imaging technology with a wide range of applications. Novel demands have occurred for the OCT technology due to the developing application scenario requirement, such as resolution improvement, depth-of-focus decoupling, aberrations correction, and anisotropic resolution correction. Over the past decades, computational imaging methods have been demonstrated effective in improving previous performance parameters. This paper focuses on the four performance improvement demands and reviews several representative computational methods. The analysis compares the strengths and weaknesses of respective solutions and outlooks future trends of computation-enhanced OCT technology, with the aim to provide references for the further study and its applications.
    87  Artificial Intelligence-Assisted Magnetic Resonance Imaging in Assessment of Neoadjuvant Chemotherapy for Breast Cancer: A Review
    LIU Kaiwen JIN Yingying WANG Shouju
    2024, 39(4):794-812. DOI: 10.16337/j.1004-9037.2024.04.003
    [Abstract](1422) [HTML](1308) [PDF 2.75 M](1366)
    Abstract:
    Neoadjuvant chemotherapy has become a standard treatment strategy for breast cancer, and magnetic resonance imaging (MRI) is the preferred imaging method for assessing the response of breast cancer to neoadjuvant chemotherapy. Although MRI can provide detailed information of tumor, including location, size, and microenvironment, the precise assessment of neoadjuvant chemotherapy of breast cancer suffers from the diverse changes in tumors present in MRI images. Artificial intelligence methods based on machine learning and deep learning have demonstrated the ability to recognize complex patterns in MRI data. Through clinical radiologic feature analysis, radiomics analysis, and habitat analysis, artificial intelligence technology has significantly enhanced the performance and efficiency of assessments for breast cancer neoadjuvant chemotherapy, aiding in the realization of personalized treatment strategies. This paper introduces the MRI data and performance indicators in assessing breast cancer neoadjuvant chemotherapy, summarizes the progress of artificial intelligence applications in this field, and discusses the current challenges and potential future research directions for artificial intelligence technology in practical applications.
    88  Opportunities and Challenges of Diffusion MRI in Traditional Chinese Medicine
    WU Ye HE Lanxiang ZHANG Xinyuan FU Yunhe LIU Xiaoming HE Jianzhong
    2024, 39(4):776-793. DOI: 10.16337/j.1004-9037.2024.04.002
    [Abstract](1209) [HTML](1445) [PDF 937.74 K](1139)
    Abstract:
    Diffusion magnetic resonance imaging (dMRI) represents an advanced medical imaging modality that yields intricate insights into tissue microstructure by assessing the diffusion of water molecules within biological tissues, which is progressively integrated into clinical practices for diagnosis and treatment. Notably, within traditional Chinese medicine (TCM), dMRI has demonstrated unique potential and significance, providing an empirical foundation for TCM’s “differentiation and treatment”. Its utility extends beyond precise disease diagnosis to encompass disease progression monitoring and treatment efficacy evaluation, aligning with TCM’s principles of “preventive treatment”and “individualized treatment”. Nonetheless, the assimilation of dMRI into TCM encounters notable challenges. This review article delves into the recent applications of dMRI within TCM, scrutinizing its prospects and constraints. By fostering interdisciplinary partnerships between medical and engineering disciplines, particularly in the realm of TCM-intelligent imaging technology, this study aims to propel the application and evolution of dMRI within TCM’s diagnostic and therapeutic domains.
    89  Graph Structure Learning Method for Multi-site Autism Diagnosis Based on Multi-view Low-Rank Subspace
    HUANG Jianhui MA Di ZHANG Li
    2024, 39(4):984-995. DOI: 10.16337/j.1004-9037.2024.04.017
    [Abstract](464) [HTML](490) [PDF 2.19 M](683)
    Abstract:
    Autism spectrum disorder (ASD) stands as one of the most prevalent and genetically inherited neurodevelopmental disorders, characterized by a multitude of clinical symptoms, notably featuring social communication deficits. Effective identification of biomarkers holds paramount significance in facilitating early interventions for ASD. Many current methods leverage multi-site imaging data to augment sample size, thereby enhancing diagnostic accuracy. However, the heterogeneity of data across multiple sites, resulting from variations in imaging devices, imaging parameters, and data processing workflows, is frequently overlooked. To overcome the above problem, this paper proposes a graph structure learning method for multi-site autism diagnosis based on multi-view low-rank subspace (MVLL-GSL). Firstly, the multiple views of brain network are constructed for each sample, encompassing diverse topological information. Subsequently, samples from different classes are projected into their respective low-rank subspaces to mitigate the impact of data heterogeneity. Finally, the integration of graph structure learning with multi-task graph embedding learning, incorporating prior subnetworks and multi-view consistency regularization constraints, aims to extract more discriminative and coherent features from multi-view low-rank subspaces. The autism public ABIDE (Autism brain imaging data exchange) database is used to verify the proposed method. Experimental results show that the MVLL-GSL method improves the performance of ASD disgnosis and explains the association of different prior sub-networks with ASD pathogenesis.
    90  Target Position Detection Based on Bidirectional Fusion of Texture and Depth Information
    ZHANG Yawei FU Dongxiang
    2024, 39(5):1214-1227. DOI: 10.16337/j.1004-9037.2024.05.013
    [Abstract](673) [HTML](656) [PDF 4.29 M](826)
    Abstract:
    Aiming at the problem of how to obtain accurate positional information of objects in unstructured scenes by depth cameras with limited hardware device resources, a target position detection method based on bidirectional fusion of texture and depth information is proposed. In the learning phase, two networks adopt the full-flow bidirectional fusion (FFB6D) module, the texture information extraction part introduces the lightweight Ghost module to reduce the computation of the network, and adds the attention mechanism CBAM that can enhance useful features, and the depth information extraction part extends the local features and multilevel feature fusion to obtain more comprehensive features. In the output stage, in order to improve the efficiency, the instance semantic segmentation results are utilized to filter background points, then 3D keypoint detection is performed, and finally the position information is obtained by the least square fitting algorithm. Validations are carried out on LINEMOD, Occlusion LINEMOD and YCB-Video public datasets, whose accuracies reach 99.8%, 66.3% and 94%, respectively, and the amount of parameters is reduced by 31%, showing that the improved position estimation method can canreduce the number of parameters while guaranteeing the accuracy.
    91  Dynamic SLAM Based on Background Restoration
    LI Jiahui FAN Xinyue ZHANG Gan ZHANG Kuo
    2024, 39(5):1204-1213. DOI: 10.16337/j.1004-9037.2024.05.012
    [Abstract](790) [HTML](499) [PDF 3.14 M](826)
    Abstract:
    In the context of simultaneous localization and mapping (SLAM), the accuracy of positioning is significantly affected by interference caused by dynamic objects. This paper addresses the challenges of SLAM in dynamic environments through the removal of dynamic objects and restoration of empty regions. Semantic information is obtained using Mask-RCNN, while a polar geometry approach is employed to eliminate dynamic objects. Keyframe pixel weighted mapping enables precise recovery of void regions in both RGB and depth maps at a pixel-by-pixel level. Experimental results on the TUM dataset demonstrate an average improvement of 85.26% in pose estimation accuracy compared to ORB-SLAM2, as well as a 28.54% enhancement over DynaSLAM performance. The proposed method exhibits robust performance even in real-world scenarios.
    92  Anti-missing Mechanism Based on SOT and Rematching in Multiple Object Tracking
    ZHANG Yifeng ZHANG Jiacheng LI Yuanhao
    2024, 39(6):1479-1492. DOI: 10.16337/j.1004-9037.2024.06.016
    [Abstract](750) [HTML](608) [PDF 3.57 M](569)
    Abstract:
    Data association is an important step in multiple object tracking(MOT), which generally requires identity matching between objects and detections based on feature similarity. Some objects or detections may remain isolated after match is completed, which is the missing phenomenon that may lead to track interruption or identity confusion. Therefore, in order to improve the accuracy and stability of MOT and suppress the missing phenomenon in data association, this paper proposes an anti-missing mechanism based on high-performance single object tracker(SOT) and rematching. The mechanism uses Transformer and diffusion model to design a SOT that meets the requirements of MOT to track missing objects and rematch missing detections by remembering the object information. The effect of SOT and rematching methods in anti-missing mechanism is verified by ablation experiments, and the effect of this mechanism on the tracking performance of MOT algorithm is tested on standard datasets. The results show that the performance of all algorithms is improved comprehensively with the addition of this mechanism, which can effectively suppress the missing phenomenon in MOT.
    93  An Expression Recognition Model Based on Pyramid Split Attention and Joint Loss
    GU Rui GU Jiale SONG Cuiling
    2024, 39(6):1493-1504. DOI: 10.16337/j.1004-9037.2024.06.017
    [Abstract](564) [HTML](695) [PDF 2.10 M](643)
    Abstract:
    How to extract multi-scale features and model semantic dependencies between remote channels remains a challenge for expression recognition networks. This paper proposes a residual network based on pyramid split attention (PSA-ResNet), which replaces the 3 × 3 convolution in the ResNet50 residual module with PSA to effectively extract multi-scale features and enhance the correlation of cross channel information. In order to reduce the differences between similar expressions and expand the distance between different types of expressions, a joint loss function optimization parameter of Softmax loss and Center loss is introduced during the training process. The proposed model is simulated on two publicly available datasets, Fer2013 and CK+, and achieves accuracies of 74.26% and 98.35%, respectively, further confirming that this method has better recognition results compared to cutting-edge algorithms.
    94  Single Target Tracking of Ships Based on Adaptive Smoothing KF-PDA Algorithm
    REN Mingliang JIA Zhiqiang SHENG Qinghong SUN Zhulei
    2024, 39(6):1470-1478. DOI: 10.16337/j.1004-9037.2024.06.015
    [Abstract](666) [HTML](494) [PDF 1.48 M](512)
    Abstract:
    In view of high computational complexity of the probability data association (PDA) algorithm in cluttered environments, a data association method based on the PDA algorithm is designed. When the number of measurement points in the wavegate exceeds a certain threshold, the PDA algorithm is employed to update the target state. When the number of measurement points falls below or equals the threshold, a nearest-neighbor approach is used to filter the target measurement points.Subsequently, the Kalman filter (KF) algorithm is utilized to achieve fast filtering updates in cluttered environments.Additionally, the paper proposes an adaptive interval smoothing method that dynamically corrects the smoothing interval to achieve reverse smoothing of the overall state estimation.This approach aims to improve the algorithm’s accuracy. Experimental results of various clutter environments demonstrate that the proposed method effectively enhances the estimation accuracy of the system state while ensuring tracking efficiency. Moreover, the results validate the robustness and effectiveness of the method compared to the PDA algorithm and the KF-PDA algorithm.
    95  Review of Very Low Bitrate Image Compression Techniques
    YUE Shuang CHEN Zhe YIN Fuliang
    2025, 40(1):102-116. DOI: 10.16337/j.1004‐9037.2025.01.008
    [Abstract](915) [HTML](1114) [PDF 2.70 M](598)
    Abstract:
    Image is one of the important ways to obtain information. With the increasing demand of image transmission and storage, especially in bandwidth limited or cloud storage situations, compressing images at extremely low bitrates is of great significance for improving transmission efficiency and saving storage space. Based on this, this paper presents a systematic review of very low bitrate compression techniques for lossy images. Firstly, on the basis of problems of image compression derivative algorithms based on generative adversarial network (GAN) in terms of high-resolution image compression, generating image blur, and neglecting semantic and texture information, the latest very low bitrate image compression methods are introduced. Then, this paper elaborates image compression methods that achieve very low bitrate using non-GAN models such as layered compression, object based, and region of interest. After that, the commonly used datasets and image quality evaluation methods under lossy compression conditions are described. Finally, a summary of very low bitrate lossy image compression techniques are made, and an outlook on their subsequent development is given.
    96  Context-Aware Image Restoration Based on Fused Semantic Information
    ZU Yi ZHANG Sunjie WU Peng MA Yueheng
    2025, 40(2):401-416. DOI: 10.16337/j.1004-9037.2025.02.010
    [Abstract](371) [HTML](584) [PDF 4.22 M](481)
    Abstract:
    In recent years, generative adversarial networks have been widely used in the field of image restoration and have achieved good results. However, current methods do not consider problems of blurred structures and textures in high-resolution images (512×512), which mainly come from the lack of effective feature information. To address this problem, this paper proposes a generative adversarial network that combines image features with semantic information. Based mainly on semantic information, a context-aware image restoration model is proposed, which adaptively fuses semantic information with image features, and adaptive convolution is proposed to replace the traditional convolution, as well as a multi-scale context aggregation module is added after the decoder to capture long-distance information for contextual inference. Experiments are conducted on Places2, CelebA-HQ, Paris Street View, and Openlogo datasets, whose results show that the proposed method improves in terms of L1 loss, peak signal-to-noise ratio (PSNR), and structural similarity (SSIM) in comparison with the existing methods.
    97  Multi-scale Crossed Algorithm for Ultrasound Medical Image Segmentation Based on MSC-LSAM
    WANG Zhaoxin YANG Wenwen RONG Ze LI Zhengyu WANG Xing MA Lei
    2025, 40(2):469-484. DOI: 10.16337/j.1004-9037.2025.02.015
    [Abstract](545) [HTML](814) [PDF 3.91 M](458)
    Abstract:
    Stroke is one of the leading causes of death and disability around the world. Carotid artery stenosis (CAS) and cardiac lesions are important contributing factors to ischemic stroke, and ultrasound imaging has shown great potential in diagnosing ischemic stroke caused by CAS and cardiac lesions. But ultrasound images present significant segmentation challenges due to noise and blurred boundaries. To address this issue, the MSC-LSAM algorithm, a multi-scale crossed dual encoder network for ultrasound image segmentation is proposed. It aims to achieve rapid and accurate segmentation of carotid and cardiac cavities, assisting physicians in disease diagnosis. In the MSC-LSAM, the encoder part parallels a segment anything model (SAM) vision encoder and an UNet encoder, while the decoder part utilizes an UNet decoder. In the SAM image encoder, we froze the pretrained SAM image encoder and introduce efficient adapter blocks in Transformer layers, referred to as learnable SAM (LSAM). LSAM maintains learning capability and high generalization ability while having a low number of parameters. In the global UNet network, we incorporate the multi-scale cross-axial attention (MCA) blocks to achieve cross-fusion of multi-scale features between different axes, effectively enhancing edge segmentation capabilities and suppressing model overfitting. Following the parallel encoders, the efficient channel attention (ECA) block is added to enable integration of multi-scale features from dual encoders, reducing incorrect segmentation caused by feature level mismatches. MSC-LSAM achieves good performance on both the publicly available cardiac ultrasound dataset of CAMUS and the self-constructed carotid artery ultrasound dataset of CAUS. Average dice similarity coefficients (DSCs) for the segmentation of the two-chamber (2CH) and four-chamber (4CH) datasets in CAMUS reach 0.927 and 0.934, respectively; while the average DSC for the CAUS dataset reaches 0.917. MSC-LSAM achieves good segmentation accuracy in tasks of carotid lumen and cardiac chamber ultrasound image segmentation, surpassing mainstream segmentation algorithms, and shows promising application prospects.
    98  Text-Guided Image Editing Method Based on Diffusion Model with Mapping-Fusion Embedding
    WU Fei MA Yongheng DENG Zheying WANG Yinjie JI Yimu JING Xiaoyuan
    2025, 40(4):1035-1045. DOI: 10.16337/j.1004-9037.2025.04.016
    [Abstract](281) [HTML](229) [PDF 3.60 M](481)
    Abstract:
    Text-guided editing of real images with only images and target text prompts as input is an extremely challenging problem. Previous approaches based on fine-tuning large pre-trained diffusion models often simply interpolate and combine source and target text features to guide the image generation process, which limits their editing capabilities, while fine-tuning large diffusion models is highly susceptible to overfitting and time-consuming problems. In this paper, we propose a text-guided image editing method based on diffusion model with mapping-fusion embedding(MFE-Diffusion). The method consists of the following two components: (1) A large pre-trained diffusion model and source text feature vectors joint learning framework, which enables the model to quickly learn to reconstruct the original image. (2) A feature mapping-fusion module, which deeply fuses the feature information of the target text and the original image to generate conditional embedding that is used to guide the image editing process. Experimental validation on the challenging text-guided image editing benchmark TEdBench shows that the proposed method has advantages in image editing performance.
    99  Degradation Information-Guided Underwater Light Field Image Enhancement and Angular Reconstruction
    LIU Deyang LI Shizheng ZHU Yuhang LIU Hui
    2025, 40(2):374-383. DOI: 10.16337/j.1004-9037.2025.02.008
    [Abstract](434) [HTML](449) [PDF 3.43 M](453)
    Abstract:
    Unlike traditional 2D RGB imaging, 4D light field imaging captures the scene from multiple angular and carries its own geometric information. This feature is expected to solve the problem of underwater imaging. We propose a degradation information-guided underwater 4D light field image enhancement and angular reconstruction network based on the angular properties of 4D light field images. The network learns the degradation information of underwater images from different angular views after downsampling. It converts the degradation information into a convolution kernel to be passed to the original-size underwater light field image, realizing efficient exchange of degradation information between underwater images of different angular views. By fully using the degradation information and spatial-angular information of the underwater light field image, the network proposed in this paper can better complete the image enhancement and angular reconstruction of the underwater light field. Meanwhile, this paper proposes the spatial-angular aggregation convolution for the light field characteristics, which efficiently learns the correlation of texture information between different views by calculating the gradient difference between the centre pixel and other view pixels. The effectiveness of the network design is fully verified through quantitative experiments as well as qualitative experiments.
    100  A Few-Shot Learning Algorithm for Defect Image Generation and Data Augmentation Based on DID-AugGAN
    HUANG Lve DENG Yafeng YAN Huabiao XIAO Wenxiang
    2025, 40(5):1306-1321. DOI: 10.16337/j.1004-9037.2025.05.016
    [Abstract](293) [HTML](235) [PDF 36.59 K](831)
    Abstract:
    To address the issues of low quality, lack of realism, and poor diversity in defect images generated by generative adversarial network (GAN) under small-sample conditions, this paper proposes a defect image generation algorithm, named defect image data augmentation GAN (DID-AugGAN), aiming at enhancing defect image data under limited sample conditions. First, to overcome the difficulty of traditional convolutional networks in effectively learning non-rigid features in images from limited datasets, we design a learnable offset convolution to improve the model’s capability in capturing semantic information. Second, to prevent the loss of critical defect features and enhance the correlation among local features, we introduce a multi-scale coordinate attention module, which focuses on defect location information. Third, to enhance the discriminator’s ability to distinguish local details in input images, we redesign its architecture, transforming it from a conventional feedforward network into a UNet-like structure with symmetric encoding and decoding pathways. Finally, we conduct comparative experiments between DID-AugGAN and the baseline algorithm on the Rail-4c track fastener defect dataset, and validate the generated images using the MobileNetV3 classification network. Experimental results demonstrate that the proposed method significantly improves inception score (IS) while effectively reducing Fréchet inception distance (FID) and learned perceptual image patch similarity (LPIPS). Moreover, the classification accuracy and F1-score of MobileNetV3 are also improved. The proposed DID-AugGAN can stably generate high-quality defect images, effectively augment defect data samples, and meet the requirements of downstream tasks.
    101  Fine-Grained Image Recognition Method Based on Attention and Multi-scale Ensemble Learning
    JI Shengyu JIANG Zhikang MA Xiang YANG Lvxi
    2025, 40(2):384-400. DOI: 10.16337/j.1004-9037.2025.02.009
    [Abstract](551) [HTML](852) [PDF 4.54 M](495)
    Abstract:
    Fine-grained image recognition (FGIR) is an important research topic in the field of computer vision. Its main goal is to distinguish subclasses with high similarity in appearance under the same category. This paper focuses on the research of weakly-supervised fine-grained image recognition technology. Given the problems of insufficient use of feature of fine-grained images and difficulty in digging discriminative regions existing in the research of FGIR, the attention and multi-scale ensemble-learning based network (AMEN) is proposed. This method introduces a progressive learning network, which uses the strategy of ensemble learning to construct multi-scale base-classifiers based on three levels of output features of deep neural network in parallel, and uses the label smoothing method to carry out progressive training for multi-scale base-classifiers, so as to greatly improve the utilization of low-level features. At the same time, the efficient dual channel attention is used to impose channel weights on features, so that the network can independently select features at the channel level, so as to improve the utilization of high information correlation channels. This method also introduces a self-attention region proposal network, which promotes the model to gradually locate the more discriminative region by constructing a circular feedback mechanism, and fuses the feature information of the complete image and the discriminative region in the final classification module. Experimental results show that the recognition accuracy of AMEN on three fine-grained image datasets of CUB-200-2011, FGVC Aircraft and Stanford Cars has reached the advanced level of the field.
    102  Panoramic Image Recognition of Rock Borehole Based on Deep Learning
    XIAN Yongli CHEN Xuejian PENG Zhenming WANG Jie PENG Bo
    2025, 40(3):675-685. DOI: 10.16337/j.1004-9037.2025.03.009
    [Abstract](464) [HTML](356) [PDF 3.98 M](533)
    Abstract:
    Geotechnical borehole monitoring, as one of the most common tunneling advanced detection techniques, can truly reflect the material properties, characteristics, and groundwater conditions of geomaterials, which is vital to ensure construction safety. Based on the characteristics of the geotechnical borehole monitoring objectives, a smart visual system based on panoramic cameras is developed. The system is suitable for close-range and dynamic high-resolution imaging of the inner walls of long geotechnical boreholes. Based on the improved EfficientNetV2 network and the sliding window prediction, the rapid intelligent recognition of eight types of rock borehole images is realized. Experimental results show that the visual system can meet the requirements for close-range high-resolution panoramic imaging of long boreholes and achieve intelligent state assessment of rock materials. The recognition success rate reaches 91.49% on the test set, and the system preliminarily possesses the comprehensive intelligent evaluation capability of geotechnical borehole status.
    103  Robust Detection Method for AI-Generated Images Based on CNN-Transformer Hybrid Architecture
    KANG Xinyuan LI Fan ZHAO Hui WANG Baodong LI Xin
    2025, 40(5):1283-1293. DOI: 10.16337/j.1004-9037.2025.05.014
    [Abstract](324) [HTML](226) [PDF 31.83 K](745)
    Abstract:
    With the rapid development of deep generative models, the realism of synthetic images has been continuously improving, and various generative technologies have been deeply integrated into people’s daily life, from image generation to face manipulation, which brings attention to the authenticity of images. In addition, mainstream image classification models are mainly pre-trained on natural scene datasets with rich and varied styles, while a single prompt can generate a large amount of data, but there is an obvious homogeneity problem, which affects the imbalance of learning difficulty, thus making the traditional image binary classification training method in the generated image detection task have insufficient generalization ability. To address such issues, we propose a detection method under the difficulty and easy sample imbalance, which does not need to modify the existing classification model, and establishes an effective data augmentation paradigm by generating data self-enhancement to expand the diversity of generated data, thereby balancing the learning difficulty of the model. At the same time, we use the corrected class cross-entropy loss for sensitive punishment in difficult and easy samples. Finally, the proposed method achieves the best results in the computer vision application challenge: Real and fake image recognition competition held by the artificial intelligence society of shandong province in November 2023.
    104  Emotional Video Captioning Based on Fine-Grained Visual and Audio-Visual Dual-Branch Fusion
    GONG Yuxuan HAN Tingting
    2025, 40(5):1165-1176. DOI: 10.16337/j.1004-9037.2025.05.005
    [Abstract](198) [HTML](803) [PDF 35.75 K](617)
    Abstract:
    Emotional video captioning, as a cross-modal task integrating visual semantic parsing and emotional perception, faces the core challenge of accurately capturing the emotional cues embedded in visual content. Existing methods have two notable limitations: First, they insufficiently explore the fine-grained semantic correlations between video subjects (such as humans and objects) and their appearance and motion features, leading to a lack of refined support for visual content understanding; second, they neglect the auxiliary value of the audio modality in emotional discrimination and content semantic alignment, which restricts the comprehensive utilization of cross-modal information. To address these issues, this paper proposes a framework based on fine-grained visual and audio-visual dual-branch fusion. Specifically, the fine-grained visual feature fusion module effectively models the fine-grained semantic associations between video entities and visual contexts through pairwise interactions and deep integration of visual, object, and motion features, thereby achieving refined parsing of video content. The audio-visual dual-branch global fusion module constructs a cross-modal interaction channel to deeply fuse the integrated visual features with audio features, fully leveraging the supplementary role of audio information in emotional cue transmission and semantic constraint. Validation experiments on public benchmark datasets show that the proposed method outperforms comparative methods such as CANet and EPAN across evaluation metrics. It achieves an average improvement of 4% over EPAN method in emotional metrics, an average increase of 0.5 in semantic metrics, and an average boost of 0.7 in comprehensive metrics. Experimental results demonstrate that the proposed method can effectively enhance the quality of emotional video captioning.
    105  End-to-End Video Compression Technology and Its Application in Unmanned Aerial Vehicles
    YE Feng DONG Fanke JIA Chuanmin
    2025, 40(2):303-319. DOI: 10.16337/j.1004-9037.2025.02.003
    [Abstract](654) [HTML](990) [PDF 6.19 M](526)
    Abstract:
    The field of multimedia visual representation and transmission is undergoing profound transformation, with end-to-end optimized intelligent video coding technologies serving as the driving force. The compression of emerging video content represented by unmanned aerial vehicle (UAV) videos has further stimulated the development of core technologies and innovation in application scenarios. Focusing on end-to-end video coding technology and its initial exploration in UAV video coding, this study proposes a hierarchical bi-directional reference structure-based video coding method that addresses the shortcomings of existing models in motion representation efficiency and predictive coding accuracy. The targeted design introduces a parameter-shared motion codec, a bi-directional scaled motion representation method, and credible motion modeling technology, significantly improving the rate-distortion performance of UAV video compression and outperforming traditional video coding standards such as H.266/VVC. This work provides novel insights for the advancement of key intelligent video coding technologies and their practical applications, demonstrating promising potential for future deployment in UAV visual perception and related domains.
    106  Low Bit Rate Generative Drone Video Compression
    LIU Meiqin CHEN Hongyu ZHOU Yiming NI Wenhao
    2025, 40(2):320-333. DOI: 10.16337/j.1004-9037.2025.02.004
    [Abstract](541) [HTML](850) [PDF 4.34 M](497)
    Abstract:
    In complex environments across air, space, land, and sea, the massive volume of video data exerts tremendous pressure on limited transmission bandwidth and storage devices. Therefore, improving the coding efficiency of video compression technologies under low bit rate conditions becomes crucial. In recent years, deep learning-based video compression algorithms have made significant progress, yet due to issues such as model design flaws, mismatches between optimization objectives and perceptual quality, and biases in training data distributions, the visual perception quality at extremely low bit rates has been compromised. Generative encoding effectively improves the texture and structure restoration ability at low bit rates through data distribution learning, alleviating the problem of blur artifacts in deep video compression. However, there are still two major bottlenecks in existing research: Firstly, time domain correlation modeling is insufficient and inter-frame feature correlation is missing; secondly, the lack of dynamic bit allocation mechanism makes it difficult to achieve adaptive extraction of key information. Therefore, this article proposes a video encoding algorithm based on conditional guided diffusion model-video compression (CGDM-VC), aiming to improve the perceptual quality of videos under low bit-rate conditions while enhancing inter-frame feature modeling capabilities and preserving key information. Specifically, the algorithm designs an implicit inter-frame alignment strategy, utilizing a diffusion model to capture potential inter-frame features and reduce the computational complexity of estimating explicit motion information. Meanwhile, the designed adaptive spatio-temporal importance-aware coder can dynamically allocate code rates to optimize the generation quality of key regions. Furthermore, a perceptual loss function is introduced, combined with the learned perceptual image patch similarity (LPIPS) constraint, to improve the visual fidelity of the reconstructed frames. Experimental results demonstrate that, compared to algorithms such as deep contextual video compression (DCVC), the proposed method achieves an average LPIPS reduction of 36.49% under low bit rate conditions (<0.1 BPP), showing richer texture details and more natural visual effects.
    107  Scene Classification Method of High-Resolution Remote Sensing Images Based on FACNNCN
    ZHANG Jing YANG Yuhao CAO Feng ZHANG Chao LI Deyu
    2025, 40(6):1637-1649. DOI: 10.16337/j.1004-9037.2025.06.020
    [Abstract](192) [HTML](125) [PDF 5.81 M](429)
    Abstract:
    High-resolution remote sensing image scene classification aims to accurately perceive complex surface scenes, which is significant for the understanding and information extraction of high-resolution remote sensing images. A new scene classification method based on feature aggregated convolution neural network (FACNN) and capsule network(CapsNet), named FACNNCN, is proposed in this paper. For the proposed method, the distinguish ability and robustness of convolutional features for scene classification are enhanced by adding aggregated features. Meanwhile, the spatial relationship between geographic entity and scene is represented based on CapsNet. Therefore, the proposed method can overcome some drawbacks usually found in existing high-resolution remote sensing image scene classification methods based on CNNs. For example, the extracted representative features of scene images are insufficient and the spatial features of geographical objects are lack of consideration. The method proposed in this paper is tested on two public high-resolution remote sensing image scene classification datasets (UC Merced Land-Use and NWPU-RESISC45). Experimental results show that the classification accuracy of FACNNCN is better than those of comparison methods.
    108  Object Detection Algorithm for UAV Maritime Rescue Based on Dynamic Progressive Fusion
    HUANG Lve YU Xiaowei YAN Huabiao MAO Yuting
    2025, 40(2):334-348. DOI: 10.16337/j.1004-9037.2025.02.005
    [Abstract](655) [HTML](1064) [PDF 3.32 M](521)
    Abstract:
    Unmanned aerial vehicle (UAV) object detection plays a crucial role in maritime rescue missions. However, the varying perspectives and altitudes inherent in UAV aerial photography lead to multi-scale variations in object individuals and vessels. Additionally, the glare resulting from sunlight reflecting off the sea surface can cause false detection issues. To address these challenges and meet the lightweight requirements of real-time object detection algorithms for UAVs, this paper proposes a lightweight UAV maritime rescue object detection algorithm based on dynamic progressive fusion (DPF-YOLO), using YOLOv8n as the baseline network. Firstly, we introduce a lightweight redundant information extraction module (RIEM) that reduces redundant information in feature maps, highlighting key features to mitigate false detections caused by glare. Secondly, we propose a dynamic multi-scale feature extraction module (DMFEM) that dynamically adjusts the receptive field to accommodate objects of varying scales, enhancing multi-scale feature representation capabilities. Finally, by integrating the DMFEM module, we develop a dynamic progressive fusion network (DPFNet). This network employs a progressive fusion structure to reduce semantic differences between non-adjacent layers with objects of different scales, thereby improving multi-scale feature fusion. DPF-YOLO is designed with P2, P3 and P4 detection layer structure to accommodate the object sizes in maritime rescue scenarios, enrich multi-scale information, and enhance feature extraction for small objects. Experimental results on the SeaDronesSee v2 dataset demonstrate that DPF-YOLO achieves a detection accuracy of mAP0.5 = 72.2% with only 1.19 M of parameters. Compared to the baseline network YOLOv8n, DPF-YOLO reduces the number of parameters by 60.5%, increases the recall rate by 12.4%, and improves precision by 8.2%. The generalization experimental results on the VisDrone dataset demonstrate that DPF-YOLO possesses excellent generalization capabilities.
    109  Infrared Small Target Detection Based on Low-Rank Tensor Subspace Learning
    WANG Yan HU Hongbo PENG Zhenming
    2025, 40(2):349-364. DOI: 10.16337/j.1004-9037.2025.02.006
    [Abstract](526) [HTML](447) [PDF 5.23 M](449)
    Abstract:
    Infrared target detection system is one of the effective technical means for reliably detecting and identifying high-value targets under the conditions of background radiation and other interferences, and it is widely used in various fields. Infrared weak target detection, as an important part of the system, is still a challenging key core technology at present. In this paper, a method based on low-rank tensor spatial learning is proposed, which preserves the structural integrity of the infrared image while considering the consistency of the sequences in the spatio-temporal continuum. The spatio-temporal tensor block model is obtained through a spatio-temporal sliding window, and the infrared tensor dictionary model is constructed under different scenes using a multi-subspace learning strategy. Finally, an optimization algorithm is used to solve the proposed infrared tensor objective function to obtain the low-rank background and sparse target tensor, and the interested infrared weak targets are detected by reconstructing the image. Experimental results show that the method outperforms other existing detection algorithms for target detection in complex-background environments with high-reflection-induced false alarms and combined strong interference scenarios.
    110  Completely Unsupervised Person Re-identification Based on Camera Cluster Contrast Learning
    TIAN Qing ZHOU Zixiao
    2025, 40(1):207-216. DOI: 10.16337/j.1004-9037.2025.01.016
    [Abstract](430) [HTML](402) [PDF 1.08 M](485)
    Abstract:
    Recent unsupervised person re-identification studies have used clustering and memory dictionaries for pseudo labels to train models. However, these studies ignore that the datasets of person re-identification are collected by different cameras, that is, the distribution difference between cameras is large, and a larger camera variance will lead to decrease in model accuracy. Therefore, camera cluster contrast learning is proposed, which includes cluster contrast loss and camera contrast loss. The cluster contrast loss can realize the consistent update of memory dictionary and reduce the influence of noise labels on the model. Camera contrast loss reduces camera variance by building camera cluster center for each cluster in each camera, narrowing the camera cluster center distance of the same cluster, and making different camera cluster centers farther apart. By camera cluster contrast learning, the impact of camera variance and noise labels on the model is reduced, and the performance of person re-identification is improved. On the four public datasets, camera cluster contrast learning has shown excellent results, effectively alleviating the impact of camera variance on the model.
    111  Learnable Mask and Position Encoding Based Occluded Pedestrian Re-identification
    YANG Zhenzhen CHEN Yanan YANG Yongpeng WU Xinyi
    2025, 40(1):217-229. DOI: 10.16337/j.1004-9037.2025.01.017
    [Abstract](677) [HTML](576) [PDF 3.33 M](540)
    Abstract:
    Although the pedestrian re-identification task has made significant progress, the occlusion problem caused by different obstacles is still a challenge in practical application scenes. In order to extract more effective features from occluded pedestrians, a learnable mask and position encoding (LMPE) method is proposed. Firstly, a learnable dual attention mask generator (LDAMG) is introduced to adapt to different occlusion patterns, significantly improving the re-identification accuracy of occluded pedestrians. It makes the network more flexible and better adapts to diverse occlusion situations. At the same time, the network learns contextual information through the mask, which further improves the understanding of the scenes. In addition, we introduce the occlusion aware position encoding fusion (OAPEF) module to solve the problem of losing position information in Transformer. This method helps to perform the fusion of different regional position encoding and allows the network to gain stronger expressive ability. The integration of position encoding in all directions enables the network to understand the spatial correlation between pedestrians more accurately, and improves the ability to adapt to the occlusion situation. Finally, simulation experiments are conducted, and results demonstrate that LMPE performs well on Occluded-Duke and Occluded-ReID occluded datasets and Market-1501 and DukeMTMC-ReID unoccluded datasets, which confirms the effectiveness and superiority of the proposed method.
    112  Research Review on Low-Altitude Visual Datasets for Unmanned Aerial Vehicles
    SUN Yiming ZHAO Kejia WANG Shuo CHEN Zhenguo RUAN Yuan YE Zifan CHEN Xingrui LI Xin CHU Ruilin SONG Shengmin HU Yitian GUO Zhoupeng WANG Sen HU Qinghua ZHU Pengfei
    2025, 40(2):274-302. DOI: 10.16337/j.1004-9037.2025.02.002
    [Abstract](1453) [HTML](3313) [PDF 8.13 M](1251)
    Abstract:
    Driven by the cross-domain synergy of unmanned aerial vehicle (UAV) technology and artificial intelligence, and supported by national low-altitude economic policies and pilot reforms for airspace opening, the low-altitude visual perception has played a significant role in smart cities, inspection, rescue, and other applications. High-quality low-altitude visual data serve as the crucial foundational resource in the field of low-altitude intelligent perception, and the release and application of public datasets have been pivotal in advancing low-altitude perception technologies. Despite the proposal of numerous datasets for low-altitude visual perception, systematic organization and analysis of these datasets remain inadequate. To address this issue, this paper conducts a comprehensive survey of publicly released low-altitude UAV vision-related datasets over the past 11 years, categorizes and explores them based on different data characteristics and application scenarios, and selects representative datasets for detailed analysis. This review covers multiple domains, including single-UAV perception, multi-UAV cooperative perception, multi-task perception, multi-source perception, complex environmental characteristics, and UAV embodied intelligence. To facilitate researchers’ understanding and use, the paper summarizes the basic information of all datasets in graphical form and systematically analyzes their development trends from two main dimensions: (1) metadata analysis, including dataset size distribution, scenario distribution, and supported task types; and (2) basic information analysis, involving total image and video counts, target category distribution, and annotation instance numbers. The analysis fully demonstrates the significant progress in the quality of low-altitude visual perception datasets. Meanwhile, it points out that, despite the initial formation of a systematic framework for low-altitude data, issues such as the imbalance between cost and efficiency in low-altitude data annotation, insufficient reusability of multi-source data, inadequate coverage of extreme environments, and fragmented embodied intelligence data still exist. Finally, this paper proposes outlooks for the future development of low-altitude datasets.
    113  Composite-Cost-Based Fast Light-Field 3D Imaging Method for Handling Spatial Occlusions
    LI Anhu GONG Zhenyu ZHAO Xin
    2025, 40(2):365-373. DOI: 10.16337/j.1004-9037.2025.02.007
    [Abstract](396) [HTML](578) [PDF 2.04 M](456)
    Abstract:
    Light field cameras, with their multi-dimensional imaging capabilities and minimal resource allocation, expand the exploration boundaries of imaging applications in unstructured air-ground-sea environments. The process of light field imaging is susceptible to occlusion and noise , and may produce unreliable depth estimation. This paper proposes a fast light fields depth estimation method for spatial occlusion-oriented, analyzes the main factors affecting the accuracy of depth estimation in depth, and establishes the optimal light field fast filtering architecture for different spatial occlusion modes. Then a highly integrated composite cost is constructed using single-bit features of pixel points to achieve depth image refinement and occlusion optimization. The experiments demonstrate that the computational efficiency of this method is significantly better than those of Markov random fields, and can reduce the MSE by 51.3%, the reliability of the depth estimation algorithm is improved at a lower operational cost, and this method is expected to provide strong support for the application of light-field imaging technology in complex scenes.
    114  Multi-view 3D Reconstruction Network Based on Dilated Attention and Depth Optimal Correction
    XU Lei LEI Youyuan ZHU Jun ZHOU Jie SHAO Genfu ZHANG Jiaming
    2025, 40(4):1023-1034. DOI: 10.16337/j.1004-9037.2025.04.015
    [Abstract](257) [HTML](293) [PDF 2.20 M](465)
    Abstract:
    The memory consumption issue in MVSNet reconstruction networks, compared with CVP-MVSNet and CasMVSNet networks, reduces memory usage when processing high-resolution images and improving the accuracy of reconstructed point clouds. However, both networks still exhibit significant errors in point cloud completeness. To address this issue, this paper proposes DA-MVSNet, a multi-view 3D reconstruction network based on dilated attention and depth optimal correction. DA-MVSNet uses CasMVSNet as the baseline network, with an additional feature enhancement network that integrates a parallel dilated convolution and attention module, incorporating the concept of depth-wise separable convolutions. This enhancement strengthens the network’s ability to capture global features of input views, improving point cloud completeness. To further enhance the accuracy of output depth maps and prevent the feature enhancement network from extracting irrelevant background information, which can degrade the accuracy of the reconstructed point cloud, an optimization correction mechanism based on nonlinear least squares is introduced at the output stage of the network. The results show DA-MVSNet reduces the accuracy and completeness errors of the reconstructed point cloud by 2.5% and 4.7%, respectively, on the indoor scene DTU dataset, achieving better overall performance. However, due to the additional feature enhancement network and correction mechanism, the memory and time consumption of DA-MVSNet are not very higher than those of CVP-MVSNet and CasMVSNet.