ZHANG Feifei , ZHANG Jianqing , QU Sijia , ZHOU Wanting
2023, 38(1):1-20. DOI: 10.16337/j.1004-9037.2023.01.001
Abstract:With the rapid development of the social media and human-computer interaction, the volume of multimedia data, such as video, image and text, has grown tremendously. Therefore, researchers have focused their attention on the multi-modal intelligence research. As an essential and fundamental research topic in the multi-modal intelligence and artificial intelligence area, some scientific research results on the visual question answering and reasoning task have been successfully implemented in the fields of human-computer interaction, intelligent medical care, and unmanned driving. This paper makes a comprehensive overview of the related algorithms of visual question answering and reasoning, meanwhile classifies and analyzes the existing methods. Firstly, we introduce the definition of the visual question answering and reasoning task, and briefly describe the main challenges of this task. Then, we summarize the existing methods that focus on attention mechanism, graph network, model pretraining, external knowledge and explainable reasoning mechanism. After that, we comprehensively introduce the common visual question answering and reasoning benchmarks and discuss the existing methods on these benchmarks in detail. Finally, we prospect future directions of the visual question answering and reasoning task.
SUN Han , LIU Yishan , LIN Yuhan
2023, 38(1):21-50. DOI: 10.16337/j.1004-9037.2023.01.002
Abstract:Salient object detection has been widely used in computer vision tasks such as image understanding, semantic segmentation, and object tracking by simulating the human visual system to find the most attractive targets for visual attention. With the rapid development of deep learning technology, salient object detection research has made great breakthroughs. This paper presents a comprehensive and systematic survey of salient object detection based on RGB images, RGB-D/T (Depth/Thermal) images, and light field images in the past five years. Firstly, the task characteristics and research difficulties of the three research branches are analyzed. Then the research technical route of each branch is expounded and the advantages and disadvantages are analyzed. At the same time, the mainstream datasets and common performance evaluation indexes of three kinds of research branches are introduced. Finally, possible future research trends are prospected.
LIU Zitong , WANG Wei , DING Guoru , WU Qihui
2023, 38(1):51-62. DOI: 10.16337/j.1004-9037.2023.01.003
Abstract:How to quickly and accurately identify the key nodes in complex communication networks under a known network topology has become a hot spot in recent years. In this paper, we first establish the system model of weighted networks for key node identification . Then, a key node identification method based on weighted collective influence is proposed. In this method, the collective influence is used to quantify the information transmission ability of nodes, and the weighted idea is combined to represent the critical degree of weighted network nodes. Finally, five typical types of complex network models are simulated with random weight and non-random weight, respectively. Simulation results show that the proposed method outperforms the original collective influence algorithm, and the algorithm is not sensitive to the parameter of sphere radius.
GENG Pinyong , CAO Yewen , ZHAO Xiaolei , LI Zhenxing , ZHANG Xinbin
2023, 38(1):63-73. DOI: 10.16337/j.1004-9037.2023.01.004
Abstract:A shortwave wideband specific signal detection algorithm based on frequency-sensitive attention is proposed to improve the accuracy of specific signal detection and recognition in a shortwave complex electromagnetic environment. A frequency-sensitive attention mechanism with a narrow and long shape receptive field is designed based on the correlation in the time direction and the locality in the frequency direction of shortwave specific signals in the spectrogram, and an end-to-end shortwave specific signal detector frequency sensitive signal detector (FSSDet) is constructed on this basis by segmenting the feature map into strip block along the time-axis direction and calculating the self-attention in the strip block, capturing the long-distance dependence in time-axis direction and limiting the sensing range in frequency-axis direction. FSSDet can directly output the modulation type of several specific signals, as well as important parameter information such as start and end time, center frequency, and bandwidth when a spectrogram of a shortwave wideband signal is given as input. Experiments are carried out on a simulation dataset of 47 880 samples from eight classes, and the proposed method has mean average precision (mAP) as high as 98.5 above 0 dB and remains above 72.5 when the signal noise ratio (SNR) is as low as -10 dB. The results show that the proposed method detects and recognizes short wave specific signals with high accuracy and robustness under low SNR.
Yue Heng , Zhang Xiaofei , Shi Sha
2023, 38(1):74-84. DOI: 10.16337/j.1004-9037.2023.01.005
Abstract:Power quality has always attracted attention. The number of power electronic equipments in the power system and harmonics generated are increasing. The problem of harmonics has always been a topic of concern. This paper proposes a frequency estimation algorithm for power system harmonics and inter-harmonics by introducing the compressed sensing theory and the parallel factor model. First, this paper obtains the data at the signal receiving end, uses Euler’s formula to convert the sine signal into a spatial signal, and constructs the multi-delay output into a parallel factor model. Second, we compress the three slices of the model, and use the trilinear alternating least squares algorithm parallel factorization(PARAFAC). Finally, the obtained data is sparsely reconstructed to obtain the frequency of the automatic pairing. Compared with the traditional parallel factor algorithm, this method has a compression process, a minor calculation, and lower storage capacity requirements. The frequency estimation performance of the proposed algorithm is very similar to that of the traditional PARAFAC method and better than that of the estimating signal parameter via rotational invariance techniques (ESPRIT) method.
2023, 38(1):85-92. DOI: 10.16337/j.1004-9037.2023.01.006
Abstract:Self-organizing map network (SOM) is a classic unsupervised learning method with self-organizing and online learning functions. Due to its simplicity and practicality, SOM variants have been emerging to adapt to various problems. However, these work basically adopts deterministic neurons to build networks, ignoring the uncertainty information implicit in the data itself. This results in a lack of interpretability reflected by confidence in the results of these models, implying that the uncertainty characterization ability of SOM neurons is insufficient. This article proposes a new variant of SOM, called the Gaussian neuron SOM network (GNSOM). Its neuron nodes are no longer deterministic, but modeled as Gaussian neurons with Gaussian distribution. Thus, SOM is equipped with an uncertainty function to express the uncertainty of the data. In implementation, the input data are also Gaussianized, and the Jensen-Shannon (JS) divergence is used to replace the Euclidean distance as the similarity matching metric in GNSOM learning, thereby obtaining the uncertainty representation. The experimental results show that GNSOM has a better training effect, and can reflect the uncertainty of the data through the covariance matrix of the neuron node. Since this Gaussization of neurons is independent of SOM itself, it can be extended to other neuron models.
Wang Haoyu , Jeon Eunah , Zhang Weiqiang , Li Ke , Huang Yukai
2023, 38(1):93-100. DOI: 10.16337/j.1004-9037.2023.01.007
Abstract:A precise speech recognition system usually is based on a large amount of training data with handcrafted transcription, which sets a barrier to the recognition of many low-resource languages. Acoustic model sharing, which is based on the similarity of certain rich and low resource language pair, provides a new method to solve the problem and helps to build an automatic speech recognition (ASR) system without any training data of the given low resource language. This paper expands the method to Korean speech recognition. Specifically, we train an acoustic model on Mandarin data, and lay down a set of mapping rules between Mandarin and Korean phonemes. A character error rate (CER) of 27.33% is achieved on Zeroth Korean test set without using any Korean speech data. Moreover, we also test the difference between source-to-target and target-to-source phoneme mapping rules, and prove that the latter is more appropriate for acoustic model sharing.
2023, 38(1):101-110. DOI: 10.16337/j.1004-9037.2023.01.008
Abstract:Based on the unsupervised pre-training technology, wav2vec 2.0 has become a research hotspot for the state of the art performance in many low-resource languages. In this paper, the Vietnamese continuous speech recognition is carried out on the basis of the pre-trained model. The phonetics information is integrated into the connectionist temporal classification (CTC) loss function based acoustic modeling, and the phones and the position dependent phones are selected as the basic modeling units. To balance the number of modeling units and the refinement of the model, a byte-pair encoding (BPE) algorithm is used to generate phone based subwords, and the contextual information is integrated into the acoustic modeling process. Experiments are carried out on the low-resource Vietnamese development set of NIST’s BABEL task, and the proposed algorithm significantly improves the wav2vec 2.0 baseline system. The word error rate is reduced from 37.3% to 29.4%.
YANG Zixiu , JIN Yun , MA Yong , DAI Yanyan , YU Jiajia , GU Yu
2023, 38(1):111-120. DOI: 10.16337/j.1004-9037.2023.01.009
Abstract:The traning and testing data for speech emotion recognition often come from different corpora.In this case,the mode recognition performance decreases greatly due to the domain mismatch.To address this problem, we present a new composition method using graph convolutional network to represent the topological structure between the source and target databases for cross corpus speech emotion recognition. Besides,aiming at the problem of low accuracy of single feature in emotion recognition,a novel feature fusion method is proposed.Firstly, we extract the acoustic features by OpenSMILE, then extract deep features by graph convolutional neural network. With the proceeding of convolutional layers,nodes transmit the feature information to another nodes,making the deep features contain clearer feature information and more detailed semantic information. Finally, we fusion the shallow and deep features. Two classification experiments are carried out. eNTERFACE corpus is for training and Berlin corpus is for testing, and the recognition rate is 59.375%. Berlin corpus is for training and eNTERFACE corpus is for testing, and the recognition rate is 36.111%. The experimental results are higher than the best research results in the baseline system and references, which proves the effectiveness of the method proposed in this paper.
Sun Minghao , Wang Hongyuan , Wu Linyu , Zhang Ji , Zhou Qunying
2023, 38(1):121-131. DOI: 10.16337/j.1004-9037.2023.01.010
Abstract:Paying attention to the global contour and the person local details is very important for the existing person re-identification methods. In order to extract these more representative features, a person re-identification network method based on the feature Pyramid branches and the non-local attention modules is proposed to extract the global and local characterization features of person. Firstly, this method introduces a lightweight feature Pyramid branch structure, extracts features from the different network layers, and aggregates them into a two-way Pyramid structure. Secondly, in order to further improve the accuracy of person re-identification, the non-local attention module is used to extract the global features, which can not only obtain the global information of person, but also pay attention to the local details of person, so that their final fusion features are more representative. Finally, the characteristics of different layers are fused, and the joint loss function strategy is used to train the network model to significantly improve the performance of the backbone network. Through a large number of experiments on the four public person re-identification datasets, MSMT17, Market1501, DukeMTMC-ReID and PersonX, it is proved that the proposed method based on the feature Pyramid branch and the non-local attention is competitive compared with some advanced person re-identification methods.
GAO Zhijun , GU Qiaoyu , CHEN Ping , HAN Zhonghua
2023, 38(1):132-140. DOI: 10.16337/j.1004-9037.2023.01.011
Abstract:To solve the problem of insufficient spatial and temporal feature in the process of dangerous behavior recognition, this paper improves the traditional dual-stream convolution model and proposes a new dual-steam convolution dangerous behavior recognition model based on CNN-LSTM. In this model, CNN network and LSTM network are connected in parallel. CNN network is used as the spatial flow. The spatial motion attitude information of human skeleton is divided into static and dynamic. These features are fused as the output of the spatial flow. In order to increase the ability of extracting temporal features of human skeleton, an improved temporal sliding LSTM network is used in the time stream. Finally, the two branches are fused in time and space, and the dangerous actions are classified and identified by Softmax. Experimental results on NTU RGB D and Kinetics datasets show that the average cross view(CV) accuracy of the improved model is 92.5% and the average cross subject(CS) accuracy is 87.9%. The proposed method is superior to that before improvement and other methods. It can effectively recognize dangerous human actions and has good discrimination effect for fuzzy actions.
CAO Siying , ZHANG Xuan , PU Tian , PENG Zhenming
2023, 38(1):141-149. DOI: 10.16337/j.1004-9037.2023.01.012
Abstract:Low-quality images under harsh atmospheric conditions such as colored fog, smoke and dust are characterized by low visibility and color cast, which bring difficulties to human observation and computer vision applications. Current enhancement algorithms for such images usually ignore the influence of the distance from the scene to the camera on the color cast. In order to better restore color while enhancing visibility, a relationship model between visibility reduction, color cast and distance and its solution method are proposed. First, the distance is estimated by the local brightness of the image, and the color cast matrix of the image is estimated by the distance. Then, the visibility and color restored image is obtained by solving the degradation model. Finally, the restored image is fused with a contrast limited adaptive histogram equalization (CLAHE)enhanced image by distance weighting for further detail enhancement. Experiments show that, compared with similar methods, the proposed method achieves high image quality evaluation indexes and has significantly better color recovery results.
Xuan Yang , Lyu Hongqiang , An Wei , Liu Xuejun
2023, 38(1):150-161. DOI: 10.16337/j.1004-9037.2023.01.013
Abstract:Vortex plays a crucial role in the formation and maintenance of various flow structures in fluid motion. The identification and detection of vortices are helpful to understand the flow laws. Traditional vortex detection methods have many shortcomings, such as inaccurate definition, heavy dependence on empirical threshold and poor generalization performance, which make vortex detection challenging. In this paper, a vortex detection model based on object detection algorithm is proposed from the perspective of computer vision. Aiming at the problem that the original object detection model has unsatisfactory detection accuracy on slender vortices with extreme aspect ratio, this paper analyzes the data characteristics of two different types of vortices. A feature adaptive module based on deformable convolutional network (DCN) and a slender sample mining method based on improved loss function are proposed. The cylindrical wake vortex and submarine tail vortex data sets are used to verify the proposed model. Experimental results show that the improved model improves the detection accuracy significantly, and the detection accuracy of slender vortex is especially significantly improved, which effectively balances the performance of various types of vortex detection.
SHA Mengzhou , SHEN Tao , ZENG Kai , MA Qian , ZENG Wenjian
2023, 38(1):162-173. DOI: 10.16337/j.1004-9037.2023.01.014
Abstract:Aiming at the problem that the multi-scale and small-scale of pedestrians in unmanned scenario causes the increase of missed detection rate and the decrease of detection accuracy, this paper proposes a pedestrian detection method that fuses deep and shallow layer features and cascade dynamic selection mechanism. Firstly, on the basis of YOLO v3-tiny, we improve the feature extraction part based on the densely connected convolutional neural network, and fuse the deep and shallow features of pedestrians to enhance the network’s ability to recognize pedestrians. Secondly, we cascade the attention module with dynamic selection mechanism on the improved backbone network to make the detection network more adaptable to dynamic pedestrian scale changes. Finally, we choose the BDD 100K dataset and the Caltech pedestrian dataset to conduct experiments. Under the premise of real-time performance (25 ms/sheet), the missed detection rate of pedestrian is reduced by 11.4% and the average detection accuracy is improved by 11.7% in the BDD 100K dataset, and the missed detection rate of pedestrian is reduced by 10.1% and the average detection accuracy is improved by 6.7% in the Caltech dataset, which is suitable for unmanned pedestrian detection.
XIE Conghua , LUO Defeng , FANG Yujie
2023, 38(1):174-185. DOI: 10.16337/j.1004-9037.2023.01.015
Abstract:Shot boundary detection (SBD) of lecture video is of great significance to teaching evaluation (TE). This paper proposes a new SBD method to address the problems that the changes of visual information of lecture videos are subtle, only boundary information is insufficient and the detection results of current methods are not beneficial to TE. The proposed method is based on the vision and text representation learning features with attention mechanism. Firstly, the hierarchical vision transformer (HViT) model is proposed to learn the visual features from the regions of interest (ROI) such as screen projection, teacher and students. Secondly, the hierarchical text transformer (HTT) model is proposed to learn features concerned in teaching evaluation from the speech and screen text. Finally, the loss function is constructed with binary cross entropies of the shot classification and boundary detection jointly. Experimental results on CLShots dataset show that the average precision, recall, F1-score and mean intersection over union of our method are higher by 23.3%, 22.4%, 22% and 35.7% compared with those of the state-of-art method of SBLV, while higher by 13.8%,14.5%,14.3% and 21.3% compared with those of the method of TransNet V2.
ZHANG Qinming , HUANG Danfei , LIU Zhiying , ZHONG Aiqi
2023, 38(1):186-192. DOI: 10.16337/j.1004-9037.2023.01.016
Abstract:Obtaining accurate noise estimation in texture-rich hyperspectral images is difficult in the noise estimation task. A spectral decorrelation method based on the spatial regularity and spectral correlation of hyperspectral images is described in this paper. Homogenous region division is a key step in many noise estimation methods, and a precise homogeneous region division can effectively improve the accuracy of noise estimation. To this end, a simple linear iterative clustering algorithm is combined with spectral-spatial similarity to segment hyperspectral images into locally structured similar image blocks to maintain homogeneous features. Spectral information divergence and spectral angle are combined as the spectral distance measurement to improve the ability of discrimination between spectra. Spectral correlations are removed within homogeneous regions by multiple linear regression to obtain the noise levels of the residual images. Various degrees of noise are added to simulated images of varying ground complexity, and the effectiveness and stability of this method are verified by comparison with a variety of methods. Finally, the proposed method is successfully applied to noise level estimation of Urban data, and can accurately identify bands heavily polluted by noise.
2023, 38(1):193-208. DOI: 10.16337/j.1004-9037.2023.01.017
Abstract:Sensitivity encoding (SENSE) is a widely used parallel magnetic resonance imaging (MRI) reconstruction model. Many improved models have been proposed to improve the reconstruction performance of SENSE. However, the reconstructed images of these improved methods still have many artifacts. Especially, it is difficult to reconstruct a clearer image when the acceleration factor is higher. Therefore, based on nonlocal low-rank(NLR) constraints, this paper proposes an improved SENSE model, named NLR-SENSE model, which can effectively improve the quality of parallel MRI reconstructed images. We adopt the weighted kernel norm as the rank surrogate function, and use the alternating direction multiplier method (ADMM) to solve the NLR-SENSE model. Simulation results show that, compared with several other parallel MRI reconstruction methods, the NLR-SENSE model performs better in visual comparison and three different objective metrics, and can effectively improve the quality of the reconstructed image.
2023, 38(1):209-219. DOI: 10.16337/j.1004-9037.2023.01.018
Abstract:In recent years, deep learning has shown its advantages in the research of image caption technology. In deep learning model, the relationship between objects in image plays an important role in image representation. In order to better detect the visual relationship in the image, an image caption generation model (YOLOv4-GCN-GRU, YGG) is constructed based on graph neural network and guidance vector. The model uses the spatial and semantic information of the detected objects in the image to build a graph, and uses graph convolutional network (GCN) as an encoder to represent each region of the graph. In the process of decoding, an additional guidance neural network is trained to generate guidance vector, so as to assist the decoder to automatically generate sentences. Comparative experiments based on MSCOCO image dataset show that YGG model has better performance, and the performance of CIDEr-D is improved from 138.9% to 142.1%.
2023, 38(1):220-230. DOI: 10.16337/j.1004-9037.2023.01.019
Abstract:Pseudorange error is a key factor affecting the positioning accuracy of the BeiDou satellite navigation receiver. A two-stage pseudorange error compensation method based on the pseudorange difference and the adaptive cubature Kalman filter (CKF) for BeiDou navigation receiver is proposed in this paper. Pseudorange error is divided into the self error and the common error. Firstly, the self error is compensated with the pseudorange difference method. Secondly, the measure noise adaptive CKF filter is designed to estimate the state of the receiver moving system in order to compensate the common error. Experimental results show that the the two-stage compensation method is slightly better under static conditions. The two-stage compensation reduces the localization error significantly than the single-stage compensation when the carrier is dynamic, and the adaptive CKF algorithm has better adaptability to noise and interference than the CKF algorithm.
Quick search
Volume retrievalYou are the visitor 
Mailing Address:29Yudao Street,Nanjing,China
Post Code:210016 Fax:025-84892742
Phone:025-84892742 E-mail:sjcj@nuaa.edu.cn
Supported by:Beijing E-Tiller Technology Development Co., Ltd.
Copyright: ® 2026 All Rights Reserved
Author Login
Reviewer Login
Editor Login
Reader Login
External Links