In recent years, deep learning has shown its advantages in the research of image caption technology. In deep learning model, the relationship between objects in image plays an important role in image representation. In order to better detect the visual relationship in the image, an image caption generation model (YOLOv4-GCN-GRU, YGG) is constructed based on graph neural network and guidance vector. The model uses the spatial and semantic information of the detected objects in the image to build a graph, and uses graph convolutional network (GCN) as an encoder to represent each region of the graph. In the process of decoding, an additional guidance neural network is trained to generate guidance vector, so as to assist the decoder to automatically generate sentences. Comparative experiments based on MSCOCO image dataset show that YGG model has better performance, and the performance of CIDEr-D is improved from 138.9% to 142.1%.