Abstract:Multimodal aspect level sentiment analysis aims to integrate graphic modal data to accurately predict the emotional polarity of aspect words.However, the existing methods still have significant limitations in accurately locating text-related image region features and effectively processing the information interaction between modalities. At the same time, the understanding of context information within modalities is biased, which leads to additional noise.In order to solve the above problems, a multi-modal aspect-level sentiment analysis model based on GCN and Target Visual Feature Enhancement (GCN-TVFE) is proposed. First of all, this paper uses CLIP model to process text, aspect words and image data. By calculating the similarity between text and image and the similarity between aspect words and image, and combining these two similarities, the quantitative evaluation of the matching degree between text and image and aspect words and image is realized. Then, the Faster R-CNN model is used to quickly and accurately identify and locate the target region in the image, which further enhances the ability of the model to extract image features related to text. Secondly, through the GCN network, the text graph structure is constructed by using the dependency syntactic relationship between texts, and the image graph structure is generated by KNN algorithm, so as to dig the feature information in the mode deeply. Finally, the multi-layer and multi-modal interactive attention mechanism is used to effectively capture the correlation information between aspect words and text, and between target visual features and image-generated text description features, which significantly reduces noise interference and enhances feature interaction between modes. The experimental results show that the model proposed in this paper has superior comprehensive performance on the public data sets Twitter-2015 and Twitter-2017, which verifies the effectiveness of the model in the field of multimodal sentiment analysis.