Self-supervised video representation learning by uncovering spatio-temporal statistics

Jiangliu Wang, Jianbo Jiao, Linchao Bao*, Shengfeng He, Wei Liu, Yun Hui Liu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)


This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at:

Original languageEnglish
Pages (from-to)3791-3806
Number of pages16
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Issue number7
Early online date10 Feb 2021
Publication statusPublished - Jul 2022

Bibliographical note

Funding Information:
This work was supported by the Hong Kong RGC TRS under T42-409/18-R, the Hong Kong ITC under Grant ITS/448/16FP, the VC Fund 4930745 of the CUHK T Stone Robotics Institute, the Hong Kong Centre for Logistics Robotics, the Hong Kong-Shenzhen Innovation and Technology Research Institute (Futian), the National Natural Science Foundation of China (No. 61972162), the EPSRC Programme Grant Seebibyte EP/M013774/1, and Visual AI EP/T028572/1

Publisher Copyright:
© 1979-2012 IEEE.


  • 3D CNN
  • Self-supervised learning
  • representation learning
  • video understanding

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Computational Theory and Mathematics
  • Artificial Intelligence
  • Applied Mathematics


Dive into the research topics of 'Self-supervised video representation learning by uncovering spatio-temporal statistics'. Together they form a unique fingerprint.

Cite this