• Volume 38,Issue 1,2023 Table of Contents
    Select All
    Display Type: |
    • Recent Advances in Visual Question Answering and Reasoning

      2023, 38(1):1-20. DOI: 10.16337/j.1004-9037.2023.01.001

      Abstract (1589) HTML (1301) PDF 1.95 M (2969) Comment (0) Favorites

      Abstract:With the rapid development of the social media and human-computer interaction, the volume of multimedia data, such as video, image and text, has grown tremendously. Therefore, researchers have focused their attention on the multi-modal intelligence research. As an essential and fundamental research topic in the multi-modal intelligence and artificial intelligence area, some scientific research results on the visual question answering and reasoning task have been successfully implemented in the fields of human-computer interaction, intelligent medical care, and unmanned driving. This paper makes a comprehensive overview of the related algorithms of visual question answering and reasoning, meanwhile classifies and analyzes the existing methods. Firstly, we introduce the definition of the visual question answering and reasoning task, and briefly describe the main challenges of this task. Then, we summarize the existing methods that focus on attention mechanism, graph network, model pretraining, external knowledge and explainable reasoning mechanism. After that, we comprehensively introduce the common visual question answering and reasoning benchmarks and discuss the existing methods on these benchmarks in detail. Finally, we prospect future directions of the visual question answering and reasoning task.

    • Deep Learning Based Salient Object Detection: A Survey

      2023, 38(1):21-50. DOI: 10.16337/j.1004-9037.2023.01.002

      Abstract (2493) HTML (1272) PDF 5.89 M (4669) Comment (0) Favorites

      Abstract:Salient object detection has been widely used in computer vision tasks such as image understanding, semantic segmentation, and object tracking by simulating the human visual system to find the most attractive targets for visual attention. With the rapid development of deep learning technology, salient object detection research has made great breakthroughs. This paper presents a comprehensive and systematic survey of salient object detection based on RGB images, RGB-D/T (Depth/Thermal) images, and light field images in the past five years. Firstly, the task characteristics and research difficulties of the three research branches are analyzed. Then the research technical route of each branch is expounded and the advantages and disadvantages are analyzed. At the same time, the mainstream datasets and common performance evaluation indexes of three kinds of research branches are introduced. Finally, possible future research trends are prospected.

    • A Key Node Identification Approach for Weighted Communication Networks

      2023, 38(1):51-62. DOI: 10.16337/j.1004-9037.2023.01.003

      Abstract (1040) HTML (1009) PDF 1.84 M (2117) Comment (0) Favorites

      Abstract:How to quickly and accurately identify the key nodes in complex communication networks under a known network topology has become a hot spot in recent years. In this paper, we first establish the system model of weighted networks for key node identification . Then, a key node identification method based on weighted collective influence is proposed. In this method, the collective influence is used to quantify the information transmission ability of nodes, and the weighted idea is combined to represent the critical degree of weighted network nodes. Finally, five typical types of complex network models are simulated with random weight and non-random weight, respectively. Simulation results show that the proposed method outperforms the original collective influence algorithm, and the algorithm is not sensitive to the parameter of sphere radius.

    • Shortwave Wideband Specific Signal Detection Based on Frequency-Sensitive Attention

      2023, 38(1):63-73. DOI: 10.16337/j.1004-9037.2023.01.004

      Abstract (1194) HTML (779) PDF 1.78 M (1819) Comment (0) Favorites

      Abstract:A shortwave wideband specific signal detection algorithm based on frequency-sensitive attention is proposed to improve the accuracy of specific signal detection and recognition in a shortwave complex electromagnetic environment. A frequency-sensitive attention mechanism with a narrow and long shape receptive field is designed based on the correlation in the time direction and the locality in the frequency direction of shortwave specific signals in the spectrogram, and an end-to-end shortwave specific signal detector frequency sensitive signal detector (FSSDet) is constructed on this basis by segmenting the feature map into strip block along the time-axis direction and calculating the self-attention in the strip block, capturing the long-distance dependence in time-axis direction and limiting the sensing range in frequency-axis direction. FSSDet can directly output the modulation type of several specific signals, as well as important parameter information such as start and end time, center frequency, and bandwidth when a spectrogram of a shortwave wideband signal is given as input. Experiments are carried out on a simulation dataset of 47 880 samples from eight classes, and the proposed method has mean average precision (mAP) as high as 98.5 above 0 dB and remains above 72.5 when the signal noise ratio (SNR) is as low as -10 dB. The results show that the proposed method detects and recognizes short wave specific signals with high accuracy and robustness under low SNR.

    • A Harmonic and Inter-harmonic Frequency Estimation Method of Electric Power Systems via Compressed Sensing PARAFAC Method

      2023, 38(1):74-84. DOI: 10.16337/j.1004-9037.2023.01.005

      Abstract (703) HTML (585) PDF 1.02 M (1293) Comment (0) Favorites

      Abstract:Power quality has always attracted attention. The number of power electronic equipments in the power system and harmonics generated are increasing. The problem of harmonics has always been a topic of concern. This paper proposes a frequency estimation algorithm for power system harmonics and inter-harmonics by introducing the compressed sensing theory and the parallel factor model. First, this paper obtains the data at the signal receiving end, uses Euler’s formula to convert the sine signal into a spatial signal, and constructs the multi-delay output into a parallel factor model. Second, we compress the three slices of the model, and use the trilinear alternating least squares algorithm parallel factorization(PARAFAC). Finally, the obtained data is sparsely reconstructed to obtain the frequency of the automatic pairing. Compared with the traditional parallel factor algorithm, this method has a compression process, a minor calculation, and lower storage capacity requirements. The frequency estimation performance of the proposed algorithm is very similar to that of the traditional PARAFAC method and better than that of the estimating signal parameter via rotational invariance techniques (ESPRIT) method.

    • Research on Self-organizing Map Network Based on Gaussian Neuron

      2023, 38(1):85-92. DOI: 10.16337/j.1004-9037.2023.01.006

      Abstract (683) HTML (316) PDF 1.66 M (872) Comment (0) Favorites

      Abstract:Self-organizing map network (SOM) is a classic unsupervised learning method with self-organizing and online learning functions. Due to its simplicity and practicality, SOM variants have been emerging to adapt to various problems. However, these work basically adopts deterministic neurons to build networks, ignoring the uncertainty information implicit in the data itself. This results in a lack of interpretability reflected by confidence in the results of these models, implying that the uncertainty characterization ability of SOM neurons is insufficient. This article proposes a new variant of SOM, called the Gaussian neuron SOM network (GNSOM). Its neuron nodes are no longer deterministic, but modeled as Gaussian neurons with Gaussian distribution. Thus, SOM is equipped with an uncertainty function to express the uncertainty of the data. In implementation, the input data are also Gaussianized, and the Jensen-Shannon (JS) divergence is used to replace the Euclidean distance as the similarity matching metric in GNSOM learning, thereby obtaining the uncertainty representation. The experimental results show that GNSOM has a better training effect, and can reflect the uncertainty of the data through the covariance matrix of the neuron node. Since this Gaussization of neurons is independent of SOM itself, it can be extended to other neuron models.

    • Zero Resource Korean ASR Based on Acoustic Model Sharing

      2023, 38(1):93-100. DOI: 10.16337/j.1004-9037.2023.01.007

      Abstract (789) HTML (675) PDF 1.22 M (1767) Comment (0) Favorites

      Abstract:A precise speech recognition system usually is based on a large amount of training data with handcrafted transcription, which sets a barrier to the recognition of many low-resource languages. Acoustic model sharing, which is based on the similarity of certain rich and low resource language pair, provides a new method to solve the problem and helps to build an automatic speech recognition (ASR) system without any training data of the given low resource language. This paper expands the method to Korean speech recognition. Specifically, we train an acoustic model on Mandarin data, and lay down a set of mapping rules between Mandarin and Korean phonemes. A character error rate (CER) of 27.33% is achieved on Zeroth Korean test set without using any Korean speech data. Moreover, we also test the difference between source-to-target and target-to-source phoneme mapping rules, and prove that the latter is more appropriate for acoustic model sharing.

    • Vietnamese Speech Recognition Based on Pre-training and Phone-Based Byte-Pair Encoding

      2023, 38(1):101-110. DOI: 10.16337/j.1004-9037.2023.01.008

      Abstract (952) HTML (764) PDF 893.81 K (1616) Comment (0) Favorites

      Abstract:Based on the unsupervised pre-training technology, wav2vec 2.0 has become a research hotspot for the state of the art performance in many low-resource languages. In this paper, the Vietnamese continuous speech recognition is carried out on the basis of the pre-trained model. The phonetics information is integrated into the connectionist temporal classification (CTC) loss function based acoustic modeling, and the phones and the position dependent phones are selected as the basic modeling units. To balance the number of modeling units and the refinement of the model, a byte-pair encoding (BPE) algorithm is used to generate phone based subwords, and the contextual information is integrated into the acoustic modeling process. Experiments are carried out on the low-resource Vietnamese development set of NIST’s BABEL task, and the proposed algorithm significantly improves the wav2vec 2.0 baseline system. The word error rate is reduced from 37.3% to 29.4%.

    • Deep and Shallow Feature Fusion Based on Graph Convolution for Cross-Corpus Emotion Recognition

      2023, 38(1):111-120. DOI: 10.16337/j.1004-9037.2023.01.009

      Abstract (702) HTML (473) PDF 2.53 M (1593) Comment (0) Favorites

      Abstract:The traning and testing data for speech emotion recognition often come from different corpora.In this case,the mode recognition performance decreases greatly due to the domain mismatch.To address this problem, we present a new composition method using graph convolutional network to represent the topological structure between the source and target databases for cross corpus speech emotion recognition. Besides,aiming at the problem of low accuracy of single feature in emotion recognition,a novel feature fusion method is proposed.Firstly, we extract the acoustic features by OpenSMILE, then extract deep features by graph convolutional neural network. With the proceeding of convolutional layers,nodes transmit the feature information to another nodes,making the deep features contain clearer feature information and more detailed semantic information. Finally, we fusion the shallow and deep features. Two classification experiments are carried out. eNTERFACE corpus is for training and Berlin corpus is for testing, and the recognition rate is 59.375%. Berlin corpus is for training and eNTERFACE corpus is for testing, and the recognition rate is 36.111%. The experimental results are higher than the best research results in the baseline system and references, which proves the effectiveness of the method proposed in this paper.

    • Person Re-identification Based on Feature Pyramid Branch and Non-local Attention

      2023, 38(1):121-131. DOI: 10.16337/j.1004-9037.2023.01.010

      Abstract (857) HTML (758) PDF 1.58 M (1717) Comment (0) Favorites

      Abstract:Paying attention to the global contour and the person local details is very important for the existing person re-identification methods. In order to extract these more representative features, a person re-identification network method based on the feature Pyramid branches and the non-local attention modules is proposed to extract the global and local characterization features of person. Firstly, this method introduces a lightweight feature Pyramid branch structure, extracts features from the different network layers, and aggregates them into a two-way Pyramid structure. Secondly, in order to further improve the accuracy of person re-identification, the non-local attention module is used to extract the global features, which can not only obtain the global information of person, but also pay attention to the local details of person, so that their final fusion features are more representative. Finally, the characteristics of different layers are fused, and the joint loss function strategy is used to train the network model to significantly improve the performance of the backbone network. Through a large number of experiments on the four public person re-identification datasets, MSMT17, Market1501, DukeMTMC-ReID and PersonX, it is proved that the proposed method based on the feature Pyramid branch and the non-local attention is competitive compared with some advanced person re-identification methods.

    • Dangerous Behavior Recognition Based on CNN-LSTM Dual-Stream Fusion Network

      2023, 38(1):132-140. DOI: 10.16337/j.1004-9037.2023.01.011

      Abstract (1333) HTML (838) PDF 1.25 M (1526) Comment (0) Favorites

      Abstract:To solve the problem of insufficient spatial and temporal feature in the process of dangerous behavior recognition, this paper improves the traditional dual-stream convolution model and proposes a new dual-steam convolution dangerous behavior recognition model based on CNN-LSTM. In this model, CNN network and LSTM network are connected in parallel. CNN network is used as the spatial flow. The spatial motion attitude information of human skeleton is divided into static and dynamic. These features are fused as the output of the spatial flow. In order to increase the ability of extracting temporal features of human skeleton, an improved temporal sliding LSTM network is used in the time stream. Finally, the two branches are fused in time and space, and the dangerous actions are classified and identified by Softmax. Experimental results on NTU RGB D and Kinetics datasets show that the average cross view(CV) accuracy of the improved model is 92.5% and the average cross subject(CS) accuracy is 87.9%. The proposed method is superior to that before improvement and other methods. It can effectively recognize dangerous human actions and has good discrimination effect for fuzzy actions.

    • Low-Quality Image Enhancement Based on Distance Weighted Color Cast Estimation

      2023, 38(1):141-149. DOI: 10.16337/j.1004-9037.2023.01.012

      Abstract (754) HTML (553) PDF 2.24 M (1634) Comment (0) Favorites

      Abstract:Low-quality images under harsh atmospheric conditions such as colored fog, smoke and dust are characterized by low visibility and color cast, which bring difficulties to human observation and computer vision applications. Current enhancement algorithms for such images usually ignore the influence of the distance from the scene to the camera on the color cast. In order to better restore color while enhancing visibility, a relationship model between visibility reduction, color cast and distance and its solution method are proposed. First, the distance is estimated by the local brightness of the image, and the color cast matrix of the image is estimated by the distance. Then, the visibility and color restored image is obtained by solving the degradation model. Finally, the restored image is fused with a contrast limited adaptive histogram equalization (CLAHE)enhanced image by distance weighting for further detail enhancement. Experiments show that, compared with similar methods, the proposed method achieves high image quality evaluation indexes and has significantly better color recovery results.

    • Vortex Detection Based on Improved Anchor-Free Object Detection Algorithm

      2023, 38(1):150-161. DOI: 10.16337/j.1004-9037.2023.01.013

      Abstract (891) HTML (558) PDF 2.73 M (1711) Comment (0) Favorites

      Abstract:Vortex plays a crucial role in the formation and maintenance of various flow structures in fluid motion. The identification and detection of vortices are helpful to understand the flow laws. Traditional vortex detection methods have many shortcomings, such as inaccurate definition, heavy dependence on empirical threshold and poor generalization performance, which make vortex detection challenging. In this paper, a vortex detection model based on object detection algorithm is proposed from the perspective of computer vision. Aiming at the problem that the original object detection model has unsatisfactory detection accuracy on slender vortices with extreme aspect ratio, this paper analyzes the data characteristics of two different types of vortices. A feature adaptive module based on deformable convolutional network (DCN) and a slender sample mining method based on improved loss function are proposed. The cylindrical wake vortex and submarine tail vortex data sets are used to verify the proposed model. Experimental results show that the improved model improves the detection accuracy significantly, and the detection accuracy of slender vortex is especially significantly improved, which effectively balances the performance of various types of vortex detection.

    • Pedestrian Detection Incorporating Deep and Shallow Features and Dynamic Selection Mechanisms

      2023, 38(1):162-173. DOI: 10.16337/j.1004-9037.2023.01.014

      Abstract (496) HTML (371) PDF 3.41 M (1566) Comment (0) Favorites

      Abstract:Aiming at the problem that the multi-scale and small-scale of pedestrians in unmanned scenario causes the increase of missed detection rate and the decrease of detection accuracy, this paper proposes a pedestrian detection method that fuses deep and shallow layer features and cascade dynamic selection mechanism. Firstly, on the basis of YOLO v3-tiny, we improve the feature extraction part based on the densely connected convolutional neural network, and fuse the deep and shallow features of pedestrians to enhance the network’s ability to recognize pedestrians. Secondly, we cascade the attention module with dynamic selection mechanism on the improved backbone network to make the detection network more adaptable to dynamic pedestrian scale changes. Finally, we choose the BDD 100K dataset and the Caltech pedestrian dataset to conduct experiments. Under the premise of real-time performance (25 ms/sheet), the missed detection rate of pedestrian is reduced by 11.4% and the average detection accuracy is improved by 11.7% in the BDD 100K dataset, and the missed detection rate of pedestrian is reduced by 10.1% and the average detection accuracy is improved by 6.7% in the Caltech dataset, which is suitable for unmanned pedestrian detection.

    • A New Shot Boundary Detection Method of Lecture Video for Teaching Evaluation

      2023, 38(1):174-185. DOI: 10.16337/j.1004-9037.2023.01.015

      Abstract (830) HTML (444) PDF 2.39 M (1599) Comment (0) Favorites

      Abstract:Shot boundary detection (SBD) of lecture video is of great significance to teaching evaluation (TE). This paper proposes a new SBD method to address the problems that the changes of visual information of lecture videos are subtle, only boundary information is insufficient and the detection results of current methods are not beneficial to TE. The proposed method is based on the vision and text representation learning features with attention mechanism. Firstly, the hierarchical vision transformer (HViT) model is proposed to learn the visual features from the regions of interest (ROI) such as screen projection, teacher and students. Secondly, the hierarchical text transformer (HTT) model is proposed to learn features concerned in teaching evaluation from the speech and screen text. Finally, the loss function is constructed with binary cross entropies of the shot classification and boundary detection jointly. Experimental results on CLShots dataset show that the average precision, recall, F1-score and mean intersection over union of our method are higher by 23.3%, 22.4%, 22% and 35.7% compared with those of the state-of-art method of SBLV, while higher by 13.8%,14.5%,14.3% and 21.3% compared with those of the method of TransNet V2.

    • Noise Estimation Based on Combined Spatial and Spectral Information for Hyperspectral Image

      2023, 38(1):186-192. DOI: 10.16337/j.1004-9037.2023.01.016

      Abstract (928) HTML (404) PDF 3.36 M (1734) Comment (0) Favorites

      Abstract:Obtaining accurate noise estimation in texture-rich hyperspectral images is difficult in the noise estimation task. A spectral decorrelation method based on the spatial regularity and spectral correlation of hyperspectral images is described in this paper. Homogenous region division is a key step in many noise estimation methods, and a precise homogeneous region division can effectively improve the accuracy of noise estimation. To this end, a simple linear iterative clustering algorithm is combined with spectral-spatial similarity to segment hyperspectral images into locally structured similar image blocks to maintain homogeneous features. Spectral information divergence and spectral angle are combined as the spectral distance measurement to improve the ability of discrimination between spectra. Spectral correlations are removed within homogeneous regions by multiple linear regression to obtain the noise levels of the residual images. Various degrees of noise are added to simulated images of varying ground complexity, and the effectiveness and stability of this method are verified by comparison with a variety of methods. Finally, the proposed method is successfully applied to noise level estimation of Urban data, and can accurately identify bands heavily polluted by noise.

    • An Improved Sensitivity Encoding Reconstruction Algorithm Based on Nonlocal Low-Rank Constraints

      2023, 38(1):193-208. DOI: 10.16337/j.1004-9037.2023.01.017

      Abstract (1085) HTML (471) PDF 9.91 M (1883) Comment (0) Favorites

      Abstract:Sensitivity encoding (SENSE) is a widely used parallel magnetic resonance imaging (MRI) reconstruction model. Many improved models have been proposed to improve the reconstruction performance of SENSE. However, the reconstructed images of these improved methods still have many artifacts. Especially, it is difficult to reconstruct a clearer image when the acceleration factor is higher. Therefore, based on nonlocal low-rank(NLR) constraints, this paper proposes an improved SENSE model, named NLR-SENSE model, which can effectively improve the quality of parallel MRI reconstructed images. We adopt the weighted kernel norm as the rank surrogate function, and use the alternating direction multiplier method (ADMM) to solve the NLR-SENSE model. Simulation results show that, compared with several other parallel MRI reconstruction methods, the NLR-SENSE model performs better in visual comparison and three different objective metrics, and can effectively improve the quality of the reconstructed image.

    • Image Caption Generation Model Based on Graph Neural Network and Guidance Vector

      2023, 38(1):209-219. DOI: 10.16337/j.1004-9037.2023.01.018

      Abstract (862) HTML (524) PDF 3.09 M (1623) Comment (0) Favorites

      Abstract:In recent years, deep learning has shown its advantages in the research of image caption technology. In deep learning model, the relationship between objects in image plays an important role in image representation. In order to better detect the visual relationship in the image, an image caption generation model (YOLOv4-GCN-GRU, YGG) is constructed based on graph neural network and guidance vector. The model uses the spatial and semantic information of the detected objects in the image to build a graph, and uses graph convolutional network (GCN) as an encoder to represent each region of the graph. In the process of decoding, an additional guidance neural network is trained to generate guidance vector, so as to assist the decoder to automatically generate sentences. Comparative experiments based on MSCOCO image dataset show that YGG model has better performance, and the performance of CIDEr-D is improved from 138.9% to 142.1%.

    • A Two-Stage Pseudorange Error Compensation Method of BeiDou Navigation Receiver

      2023, 38(1):220-230. DOI: 10.16337/j.1004-9037.2023.01.019

      Abstract (900) HTML (544) PDF 2.68 M (1677) Comment (0) Favorites

      Abstract:Pseudorange error is a key factor affecting the positioning accuracy of the BeiDou satellite navigation receiver. A two-stage pseudorange error compensation method based on the pseudorange difference and the adaptive cubature Kalman filter (CKF) for BeiDou navigation receiver is proposed in this paper. Pseudorange error is divided into the self error and the common error. Firstly, the self error is compensated with the pseudorange difference method. Secondly, the measure noise adaptive CKF filter is designed to estimate the state of the receiver moving system in order to compensate the common error. Experimental results show that the the two-stage compensation method is slightly better under static conditions. The two-stage compensation reduces the localization error significantly than the single-stage compensation when the carrier is dynamic, and the adaptive CKF algorithm has better adaptability to noise and interference than the CKF algorithm.

Quick search
Search term
Search word
From To
Volume retrieval