Self-supervised Video Representation Learning by Pace Prediction

Jiangliu Wang; Jianbo Jiao; Yun-Hui Liu

Self-supervised Video Representation Learning by Pace Prediction

Jiangliu Wang, Jianbo Jiao, Yun-Hui Liu

Computer Science

Research output: Working paper/Preprint › Preprint

44 Downloads (Pure)

Abstract

This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.

Original language	English
Publication status	Published - 13 Aug 2020

Bibliographical note

Correct some typos;Update some cocurent works accepted by ECCV 2020

Keywords

cs.CV

Access to Document

2008.05861v2

Cite this

@techreport{f07cb3653475420381ad5b5fdd228755,

title = "Self-supervised Video Representation Learning by Pace Prediction",

abstract = " This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace. ",

keywords = "cs.CV",

author = "Jiangliu Wang and Jianbo Jiao and Yun-Hui Liu",

note = "Correct some typos;Update some cocurent works accepted by ECCV 2020",

year = "2020",

month = aug,

day = "13",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - Self-supervised Video Representation Learning by Pace Prediction

AU - Wang, Jiangliu

AU - Jiao, Jianbo

AU - Liu, Yun-Hui

N1 - Correct some typos;Update some cocurent works accepted by ECCV 2020

PY - 2020/8/13

Y1 - 2020/8/13

N2 - This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.

AB - This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.

KW - cs.CV

M3 - Preprint

BT - Self-supervised Video Representation Learning by Pace Prediction

ER -

Self-supervised Video Representation Learning by Pace Prediction

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this