Abstract:The rapid advancement of AI and image editing technologies has led to a surge in tampered images, posing new challenges for Fake News Detection (FND). To address the limitation of prevailing multimodal FND methods, which primarily focus on modeling text-image semantic consistency while neglecting the possibility that images themselves may be manipulated, thereby suffering from insufficient robustness against tampered content, this paper proposes a FND approach based on Image Tampering Perception and Multi-view Fusion Network (ITPMFN). The method consists of three components: (1) a multi-view feature extraction and interaction module that employs BERT and Swin-T to capture modality-specific features from text and images, respectively, and leverages CLIP to extract cross-modal aligned semantic features, constructing a four-channel multi-view representation; (2) a collaborative reasoning-based image tampering perception and interpretable analysis generation module, which first uses a lightweight model to extract low-level statistical tampering cues and then designs enhanced prompts based on these cues to guide a multimodal large language model in generating structured, high-level tampering explanations—including manipulated objects and manipulation types—thereby providing human-interpretable decision rationales, whose semantic embeddings are encoded as fusion-ready features for downstream fake news classification; and (3) a cross-modal interaction and fusion module that applies attention mechanisms to thoroughly interact and fuse multi-view features at both intra- and inter-modal levels, yielding more discriminative representations, which are combined with tampering reasoning features for final multimodal fake news detection. Experiments on two widely used public benchmarks, Weibo and Fakeddit, demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches, and ablation studies further validate the effectiveness of each component.