Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos

Research output: Contribution to journalArticlepeer-review

38 Citations (Scopus)
394 Downloads (Pure)


This paper addresses the problems of both general and also fine-grained human action recognition in video sequences. Compared with general human actions, fine-grained action information is more difficult to detect and occupies relatively small-scale image regions. Our work seeks to improve fine-grained action discrimination, while also retaining the ability to perform general action recognition. Our method first estimates human pose and human parts positions in video sequences by extending our recent work on human pose tracking, and crops different scaled patches to obtain richer action information in a variety of different scales of appearance and motion cues. We then utilize a Convolutional Neural Network (CNN) to process each such image patch. Instead of using the output one dimension feature from the full-connection layer, we utilize the outputs of the pooling layer of CNN structure, which contains more spatial information. Then the high dimension of the pooling features is reduced by encoding, to generate the final human action descriptors for classification. Our method reduces feature dimension while also effectively combining appearance and motion information in a unified framework. We have carried out empirical experiments using two publicly available human action datasets, comparing the human action recognition result of our algorithm against six recent state-of-the-art methods from the literature. The results suggest comparatively strong performance of our method.
Original languageEnglish
Pages (from-to)506-521
JournalPattern Recognition
Early online date21 Nov 2017
Publication statusPublished - 1 Apr 2018


  • human pose
  • action recognition
  • video understanding


Dive into the research topics of 'Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos'. Together they form a unique fingerprint.

Cite this