Aiming at the limitations of the UNet architecture in capturing local features and preserving edge details in medical image segmentation, this paper presents an improved UNet algorithm integrating self-attention mechanism. The proposed algorithm is based on traditional encoder-decoder structure, incorporating a multi-scale convolution (MSC) block for multi-granularity feature extraction, and a convolution mixer attention (CMA) block, which combines the modeling of local features by convolutional layers with global contextual modeling by self-attention layers. In the segmentation task of BUSI and DDTI datasets, compared with the existing classical network architecture, a large number of experimental data verify the excellent segmentation ability of the model. Additionally, Statistical data analysis and ablation studies further confirm the effectiveness of the MSC and CMA modules. This research provides an innovative approach for high-precision medical image segmentation, holding significant theoretical and practical implications for enhancing the accuracy and efficiency of medical diagnoses.