Abstract:With the widespread application of visual perception technologies in fields such as intelligent security, behavioral analysis, and urban transportation, viewpoint-induced feature distribution shifts have become a key challenge in person re-identification. While traditional Convolutional Neural Networks excel at capturing local details, they struggle with modeling cross-view global dependencies and ensuring semantic consistency. Transformers, on the other hand, offer strong global modeling but suffer from computational redundancy and poor generalization in high-dimensional settings. To address these challenges, the network propose a multi-view cooperative feature encoding framework that integrates fine-grained local representation with global feature alignment. This framework first uses a Convolutional Neural Networks backbone to extract detailed features and then employs a Cross-View Neighborhood Transformer for low-rank modeling. By incorporating a mutual-neighborhood sparse attention mechanism, it enhances cross-camera contextual interactions and reduces redundancy in multi-view feature fusion. Additionally, an adaptive metric combination strategy is introduced to improve discriminability and recognition accuracy in complex environments. Experiments on three public benchmarks Market1501, DukeMTMC-ReID, and MSMT17—show that method outperforms existing approaches with mAP/Rank-1 scores of 91.7%/96.1%, 85.2%/92.4%, and 63.5%/83.6%, respectively, demonstrating strong generali-zation and application potential.