Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound

Jianbo Jiao; Yifan Cai; Mohammad Alsharid; Lior Drukker; Aris T. Papageorghiou; J. Alison Noble

doi:10.1007/978-3-030-59716-0_51

Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound

Jianbo Jiao^*, Yifan Cai, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Citations (Scopus)

Abstract

In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.

Original language	English
Title of host publication	Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings
Editors	Anne L. Martel, Purang Abolmaesumi, Danail Stoyanov, Diana Mateus, Maria A. Zuluaga, S. Kevin Zhou, Daniel Racoceanu, Leo Joskowicz
Publisher	Springer
Pages	534-543
Number of pages	10
ISBN (Print)	9783030597153
DOIs	https://doi.org/10.1007/978-3-030-59716-0_51
Publication status	Published - 2020
Event	23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020 - Lima, Peru Duration: 4 Oct 2020 → 8 Oct 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12263 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020
Country/Territory	Peru
City	Lima
Period	4/10/20 → 8/10/20

Bibliographical note

Funding Information:
Acknowledgements. We acknowledge the EPSRC (EP/M013774/1, Project See-bibyte), ERC(ERC-ADG-2015 694581, Project PULSE), and the support of NVIDIA Corporation with the donation of the GPU.

Publisher Copyright:
© 2020, Springer Nature Switzerland AG.

Keywords

Representation learning
Self-supervised
Video-audio

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-59716-0_51

Cite this

Jiao, J., Cai, Y., Alsharid, M., Drukker, L., Papageorghiou, A. T., & Noble, J. A. (2020). Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. In A. L. Martel, P. Abolmaesumi, D. Stoyanov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, & L. Joskowicz (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings (pp. 534-543). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12263 LNCS). Springer. https://doi.org/10.1007/978-3-030-59716-0_51

Jiao, Jianbo ; Cai, Yifan ; Alsharid, Mohammad et al. / Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings. editor / Anne L. Martel ; Purang Abolmaesumi ; Danail Stoyanov ; Diana Mateus ; Maria A. Zuluaga ; S. Kevin Zhou ; Daniel Racoceanu ; Leo Joskowicz. Springer, 2020. pp. 534-543 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{bec2a0d3d20f4e349bf36c96cf42703c,

title = "Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound",

abstract = "In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.",

keywords = "Representation learning, Self-supervised, Video-audio",

author = "Jianbo Jiao and Yifan Cai and Mohammad Alsharid and Lior Drukker and Papageorghiou, {Aris T.} and Noble, {J. Alison}",

note = "Funding Information: Acknowledgements. We acknowledge the EPSRC (EP/M013774/1, Project See-bibyte), ERC(ERC-ADG-2015 694581, Project PULSE), and the support of NVIDIA Corporation with the donation of the GPU. Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020 ; Conference date: 04-10-2020 Through 08-10-2020",

year = "2020",

doi = "10.1007/978-3-030-59716-0_51",

language = "English",

isbn = "9783030597153",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "534--543",

editor = "Martel, {Anne L.} and Purang Abolmaesumi and Danail Stoyanov and Diana Mateus and Zuluaga, {Maria A.} and Zhou, {S. Kevin} and Daniel Racoceanu and Leo Joskowicz",

booktitle = "Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings",

}

Jiao, J, Cai, Y, Alsharid, M, Drukker, L, Papageorghiou, AT & Noble, JA 2020, Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. in AL Martel, P Abolmaesumi, D Stoyanov, D Mateus, MA Zuluaga, SK Zhou, D Racoceanu & L Joskowicz (eds), Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12263 LNCS, Springer, pp. 534-543, 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020, Lima, Peru, 4/10/20. https://doi.org/10.1007/978-3-030-59716-0_51

Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. / Jiao, Jianbo; Cai, Yifan; Alsharid, Mohammad et al.
Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings. ed. / Anne L. Martel; Purang Abolmaesumi; Danail Stoyanov; Diana Mateus; Maria A. Zuluaga; S. Kevin Zhou; Daniel Racoceanu; Leo Joskowicz. Springer, 2020. p. 534-543 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12263 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound

AU - Jiao, Jianbo

AU - Cai, Yifan

AU - Alsharid, Mohammad

AU - Drukker, Lior

AU - Papageorghiou, Aris T.

AU - Noble, J. Alison

N1 - Funding Information: Acknowledgements. We acknowledge the EPSRC (EP/M013774/1, Project See-bibyte), ERC(ERC-ADG-2015 694581, Project PULSE), and the support of NVIDIA Corporation with the donation of the GPU. Publisher Copyright: © 2020, Springer Nature Switzerland AG.

PY - 2020

Y1 - 2020

N2 - In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.

AB - In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.

KW - Representation learning

KW - Self-supervised

KW - Video-audio

UR - http://www.scopus.com/inward/record.url?scp=85092701936&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-59716-0_51

DO - 10.1007/978-3-030-59716-0_51

M3 - Conference contribution

AN - SCOPUS:85092701936

SN - 9783030597153

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 534

EP - 543

BT - Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings

A2 - Martel, Anne L.

A2 - Abolmaesumi, Purang

A2 - Stoyanov, Danail

A2 - Mateus, Diana

A2 - Zuluaga, Maria A.

A2 - Zhou, S. Kevin

A2 - Racoceanu, Daniel

A2 - Joskowicz, Leo

PB - Springer

T2 - 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020

Y2 - 4 October 2020 through 8 October 2020

ER -

Jiao J, Cai Y, Alsharid M, Drukker L, Papageorghiou AT, Noble JA. Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. In Martel AL, Abolmaesumi P, Stoyanov D, Mateus D, Zuluaga MA, Zhou SK, Racoceanu D, Joskowicz L, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 - 23rd International Conference, Proceedings. Springer. 2020. p. 534-543. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-59716-0_51

Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this