Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos

Miao Ma; Naresh Marturi; Yibin Li; Ales Leonardis; Rustam Stolkin

doi:10.1016/j.patcog.2017.11.026

Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos

Miao Ma, Naresh Marturi, Yibin Li, Ales Leonardis, Rustam Stolkin

Research output: Contribution to journal › Article › peer-review

38 Citations (Scopus)

408 Downloads (Pure)

Abstract

This paper addresses the problems of both general and also fine-grained human action recognition in video sequences. Compared with general human actions, fine-grained action information is more difficult to detect and occupies relatively small-scale image regions. Our work seeks to improve fine-grained action discrimination, while also retaining the ability to perform general action recognition. Our method first estimates human pose and human parts positions in video sequences by extending our recent work on human pose tracking, and crops different scaled patches to obtain richer action information in a variety of different scales of appearance and motion cues. We then utilize a Convolutional Neural Network (CNN) to process each such image patch. Instead of using the output one dimension feature from the full-connection layer, we utilize the outputs of the pooling layer of CNN structure, which contains more spatial information. Then the high dimension of the pooling features is reduced by encoding, to generate the final human action descriptors for classification. Our method reduces feature dimension while also effectively combining appearance and motion information in a unified framework. We have carried out empirical experiments using two publicly available human action datasets, comparing the human action recognition result of our algorithm against six recent state-of-the-art methods from the literature. The results suggest comparatively strong performance of our method.

Original language	English
Pages (from-to)	506-521
Journal	Pattern Recognition
Volume	76
Early online date	21 Nov 2017
DOIs	https://doi.org/10.1016/j.patcog.2017.11.026
Publication status	Published - 1 Apr 2018

Keywords

human pose
action recognition
video understanding

Access to Document

10.1016/j.patcog.2017.11.026Licence: None: All rights reserved

Ma_et_al_Region-sequence_based_six_stream_Pattern_Recognition_2017
Checked for eligibility: 29/11/2017
Accepted author manuscript, 3.07 MBLicence: Creative Commons: Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)

http://www.sciencedirect.com/science/article/pii/S0031320317304788Licence: None: All rights reserved

KTP with KUKA Robotics UK Ltd - To develop a new toolbox of software and alogoriths
Stolkin, R. & Leonardis, A.
KUKA ROBOTICS UK LTD, KNOWLEDGE TRANSFER PARTNERSHIPS
13/01/15 → 12/01/18
Project: Research

Cite this

@article{6679e64f05684858827da6d074c2fbef,

title = "Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos",

abstract = "This paper addresses the problems of both general and also fine-grained human action recognition in video sequences. Compared with general human actions, fine-grained action information is more difficult to detect and occupies relatively small-scale image regions. Our work seeks to improve fine-grained action discrimination, while also retaining the ability to perform general action recognition. Our method first estimates human pose and human parts positions in video sequences by extending our recent work on human pose tracking, and crops different scaled patches to obtain richer action information in a variety of different scales of appearance and motion cues. We then utilize a Convolutional Neural Network (CNN) to process each such image patch. Instead of using the output one dimension feature from the full-connection layer, we utilize the outputs of the pooling layer of CNN structure, which contains more spatial information. Then the high dimension of the pooling features is reduced by encoding, to generate the final human action descriptors for classification. Our method reduces feature dimension while also effectively combining appearance and motion information in a unified framework. We have carried out empirical experiments using two publicly available human action datasets, comparing the human action recognition result of our algorithm against six recent state-of-the-art methods from the literature. The results suggest comparatively strong performance of our method.",

keywords = "human pose , action recognition , video understanding",

author = "Miao Ma and Naresh Marturi and Yibin Li and Ales Leonardis and Rustam Stolkin",

year = "2018",

month = apr,

day = "1",

doi = "10.1016/j.patcog.2017.11.026",

language = "English",

volume = "76",

pages = "506--521",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier",

}

TY - JOUR

T1 - Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos

AU - Ma, Miao

AU - Marturi, Naresh

AU - Li, Yibin

AU - Leonardis, Ales

AU - Stolkin, Rustam

PY - 2018/4/1

Y1 - 2018/4/1

N2 - This paper addresses the problems of both general and also fine-grained human action recognition in video sequences. Compared with general human actions, fine-grained action information is more difficult to detect and occupies relatively small-scale image regions. Our work seeks to improve fine-grained action discrimination, while also retaining the ability to perform general action recognition. Our method first estimates human pose and human parts positions in video sequences by extending our recent work on human pose tracking, and crops different scaled patches to obtain richer action information in a variety of different scales of appearance and motion cues. We then utilize a Convolutional Neural Network (CNN) to process each such image patch. Instead of using the output one dimension feature from the full-connection layer, we utilize the outputs of the pooling layer of CNN structure, which contains more spatial information. Then the high dimension of the pooling features is reduced by encoding, to generate the final human action descriptors for classification. Our method reduces feature dimension while also effectively combining appearance and motion information in a unified framework. We have carried out empirical experiments using two publicly available human action datasets, comparing the human action recognition result of our algorithm against six recent state-of-the-art methods from the literature. The results suggest comparatively strong performance of our method.

AB - This paper addresses the problems of both general and also fine-grained human action recognition in video sequences. Compared with general human actions, fine-grained action information is more difficult to detect and occupies relatively small-scale image regions. Our work seeks to improve fine-grained action discrimination, while also retaining the ability to perform general action recognition. Our method first estimates human pose and human parts positions in video sequences by extending our recent work on human pose tracking, and crops different scaled patches to obtain richer action information in a variety of different scales of appearance and motion cues. We then utilize a Convolutional Neural Network (CNN) to process each such image patch. Instead of using the output one dimension feature from the full-connection layer, we utilize the outputs of the pooling layer of CNN structure, which contains more spatial information. Then the high dimension of the pooling features is reduced by encoding, to generate the final human action descriptors for classification. Our method reduces feature dimension while also effectively combining appearance and motion information in a unified framework. We have carried out empirical experiments using two publicly available human action datasets, comparing the human action recognition result of our algorithm against six recent state-of-the-art methods from the literature. The results suggest comparatively strong performance of our method.

KW - human pose

KW - action recognition

KW - video understanding

U2 - 10.1016/j.patcog.2017.11.026

DO - 10.1016/j.patcog.2017.11.026

M3 - Article

SN - 0031-3203

VL - 76

SP - 506

EP - 521

JO - Pattern Recognition

JF - Pattern Recognition

ER -

Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos

Abstract

Keywords

Access to Document

Fingerprint

Projects

KTP with KUKA Robotics UK Ltd - To develop a new toolbox of software and alogoriths

Cite this