Emotional video captioning, as a cross-modal task integrating visual semantic parsing and emotional perception, faces the core challenge of accurately capturing the emotional cues embedded in visual content. Existing methods have two notable limitations: First, they insufficiently explore the fine-grained semantic correlations between video subjects (such as humans and objects) and their appearance and motion features, leading to a lack of refined support for visual content understanding; second, they neglect the auxiliary value of the audio modality in emotional discrimination and content semantic alignment, which restricts the comprehensive utilization of cross-modal information. To address these issues, this paper proposes a framework based on fine-grained visual and audio-visual dual-branch fusion. Specifically, the fine-grained visual feature fusion module effectively models the fine-grained semantic associations between video entities and visual contexts through pairwise interactions and deep integration of visual, object, and motion features, thereby achieving refined parsing of video content. The audio-visual dual-branch global fusion module constructs a cross-modal interaction channel to deeply fuse the integrated visual features with audio features, fully leveraging the supplementary role of audio information in emotional cue transmission and semantic constraint. Validation experiments on public benchmark datasets show that the proposed method outperforms comparative methods such as CANet and EPAN across evaluation metrics. It achieves an average improvement of 4% over EPAN method in emotional metrics, an average increase of 0.5 in semantic metrics, and an average boost of 0.7 in comprehensive metrics. Experimental results demonstrate that the proposed method can effectively enhance the quality of emotional video captioning.