Unified Image and Video Saliency Modeling

Richard Droste; Jianbo Jiao; J. Alison Noble

doi:10.1007/978-3-030-58558-7_25

Unified Image and Video Saliency Modeling

Richard Droste^*, Jianbo Jiao, J. Alison Noble

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

21 Citations (Scopus)

Abstract

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

Original language	English
Title of host publication	Computer Vision – ECCV 2020 - 16th European Conference, Proceedings
Editors	Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
Publisher	Springer
Pages	419-435
Number of pages	17
ISBN (Print)	9783030585570
DOIs	https://doi.org/10.1007/978-3-030-58558-7_25
Publication status	Published - 2020
Event	16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom Duration: 23 Aug 2020 → 28 Aug 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12350 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th European Conference on Computer Vision, ECCV 2020
Country/Territory	United Kingdom
City	Glasgow
Period	23/08/20 → 28/08/20

Bibliographical note

Funding Information:
We acknowledge the EPSRC (Project Seebibyte, reference EP/M013774/1) and the NVIDIA Corporation for the donation of GPU.

Publisher Copyright:
© 2020, Springer Nature Switzerland AG.

Keywords

Domain adaptation
Video saliency
Visual saliency

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-58558-7_25

Cite this

Droste, R., Jiao, J., & Noble, J. A. (2020). Unified Image and Video Saliency Modeling. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 - 16th European Conference, Proceedings (pp. 419-435). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12350 LNCS). Springer. https://doi.org/10.1007/978-3-030-58558-7_25

Droste, Richard ; Jiao, Jianbo ; Noble, J. Alison. / Unified Image and Video Saliency Modeling. Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. editor / Andrea Vedaldi ; Horst Bischof ; Thomas Brox ; Jan-Michael Frahm. Springer, 2020. pp. 419-435 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{001b27c399674518b3d3883c62b9e93f,

title = "Unified Image and Video Saliency Modeling",

abstract = "Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.",

keywords = "Domain adaptation, Video saliency, Visual saliency",

author = "Richard Droste and Jianbo Jiao and Noble, {J. Alison}",

note = "Funding Information: We acknowledge the EPSRC (Project Seebibyte, reference EP/M013774/1) and the NVIDIA Corporation for the donation of GPU. Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 16th European Conference on Computer Vision, ECCV 2020 ; Conference date: 23-08-2020 Through 28-08-2020",

year = "2020",

doi = "10.1007/978-3-030-58558-7_25",

language = "English",

isbn = "9783030585570",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "419--435",

editor = "Andrea Vedaldi and Horst Bischof and Thomas Brox and Jan-Michael Frahm",

booktitle = "Computer Vision – ECCV 2020 - 16th European Conference, Proceedings",

}

Droste, R, Jiao, J & Noble, JA 2020, Unified Image and Video Saliency Modeling. in A Vedaldi, H Bischof, T Brox & J-M Frahm (eds), Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12350 LNCS, Springer, pp. 419-435, 16th European Conference on Computer Vision, ECCV 2020, Glasgow, United Kingdom, 23/08/20. https://doi.org/10.1007/978-3-030-58558-7_25

Unified Image and Video Saliency Modeling. / Droste, Richard; Jiao, Jianbo; Noble, J. Alison.
Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. ed. / Andrea Vedaldi; Horst Bischof; Thomas Brox; Jan-Michael Frahm. Springer, 2020. p. 419-435 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12350 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Unified Image and Video Saliency Modeling

AU - Droste, Richard

AU - Jiao, Jianbo

AU - Noble, J. Alison

N1 - Funding Information: We acknowledge the EPSRC (Project Seebibyte, reference EP/M013774/1) and the NVIDIA Corporation for the donation of GPU. Publisher Copyright: © 2020, Springer Nature Switzerland AG.

PY - 2020

Y1 - 2020

N2 - Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

AB - Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

KW - Domain adaptation

KW - Video saliency

KW - Visual saliency

UR - http://www.scopus.com/inward/record.url?scp=85097385043&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-58558-7_25

DO - 10.1007/978-3-030-58558-7_25

M3 - Conference contribution

AN - SCOPUS:85097385043

SN - 9783030585570

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 419

EP - 435

BT - Computer Vision – ECCV 2020 - 16th European Conference, Proceedings

A2 - Vedaldi, Andrea

A2 - Bischof, Horst

A2 - Brox, Thomas

A2 - Frahm, Jan-Michael

PB - Springer

T2 - 16th European Conference on Computer Vision, ECCV 2020

Y2 - 23 August 2020 through 28 August 2020

ER -

Droste R, Jiao J, Noble JA. Unified Image and Video Saliency Modeling. In Vedaldi A, Bischof H, Brox T, Frahm JM, editors, Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. Springer. 2020. p. 419-435. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-58558-7_25

Unified Image and Video Saliency Modeling

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this