Abstract:In group behavior recognition, the entire group behavior can be inferred by detecting the behavior of each person in the group over a period of time. An end-to-end deep learning network combined with action vector of locally aggregated descriptor (ActionVLAD) pooling layer and multi-layer long short time memory (LSTM) is constructed to solve the group behavior recognition problem. Based on the input of traditional single image information (Red Green Blue, RGB) as a deep learning network, dense optical flow information (Dense_flow)is added to describe the motion between video frames as the input of the two-stream network. The feature information is modeled by the underlying LSTM, and the individual behavior is represented by the fused two stream features. While the ActionVLAD pooling layer can fuse features at different time and different positions of the picture, which can better integrate personal information. Finally the top LSTM is connected with the Softmax classifier, in which group activity is judged by the merged personal information. The test on Collective activity dataset obtains an average recognition accuracy of 82.3%.