Abstract:Text-guided editing of real images with only images and target text prompts as input is an extremely challenging problem. Previous approaches based on fine-tuning large pre-trained diffusion models often simply interpolate and combine source and target text features to guide the image generation process, which limits their editing capabilities, while fine-tuning large diffusion models is highly susceptible to overfitting and time-consuming problems. In this paper, we propose a Text-Guided Image Editing Method Based on Diffusion Model with Mapping-Fusion Embedding (MFE-Diffusion). The method consists of the following two components: 1) A large pre-trained diffusion model and source text feature vectors joint learning framework, which enables the model to quickly learn to reconstruct the original image. 2) A feature mapping-fusion module, which deeply fuses the feature information of the target text and the original image to generate conditional embedding that are used to guide the image editing process. Experimental validation on the challenging text-guided image editing benchmark TEdBench shows that the proposed method has advantages in image editing performance.