Show from Tell: Audio-Visual Modelling in Clinical Settings

Jianbo Jiao; Mohammad Alsharid; Lior Drukker; Aris T. Papageorghiou; Andrew Zisserman; J. Alison Noble

doi:10.48550/arXiv.2310.16477

Show from Tell: Audio-Visual Modelling in Clinical Settings

Jianbo Jiao^*, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, Andrew Zisserman, J. Alison Noble^*

^*Corresponding author for this work

Computer Science

Research output: Working paper/Preprint › Preprint

238 Downloads (Pure)

Abstract

Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions.

Original language	English
Publisher	arXiv
Pages	1-12
Number of pages	12
DOIs	https://doi.org/10.48550/arXiv.2310.16477
Publication status	Published - 25 Oct 2023

Keywords

cs.CV

Access to Document

10.48550/arXiv.2310.16477Licence: None: All rights reserved

2310.16477v1Licence: None: All rights reserved

Cite this

@techreport{2e92db9a5cf54078bd98a7d0977c8eb1,

title = "Show from Tell: Audio-Visual Modelling in Clinical Settings",

abstract = "Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions.",

keywords = "cs.CV",

author = "Jianbo Jiao and Mohammad Alsharid and Lior Drukker and Papageorghiou, {Aris T.} and Andrew Zisserman and Noble, {J. Alison}",

year = "2023",

month = oct,

day = "25",

doi = "10.48550/arXiv.2310.16477",

language = "English",

pages = "1--12",

publisher = "arXiv",

type = "WorkingPaper",

institution = "arXiv",

}

TY - UNPB

T1 - Show from Tell

T2 - Audio-Visual Modelling in Clinical Settings

AU - Jiao, Jianbo

AU - Alsharid, Mohammad

AU - Drukker, Lior

AU - Papageorghiou, Aris T.

AU - Zisserman, Andrew

AU - Noble, J. Alison

PY - 2023/10/25

Y1 - 2023/10/25

N2 - Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions.

AB - Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, the audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals -- usually speech. In this paper, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference. Experimental evaluations on a large-scale clinical multi-modal ultrasound video dataset show that the proposed self-supervised method learns good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions.

KW - cs.CV

U2 - 10.48550/arXiv.2310.16477

DO - 10.48550/arXiv.2310.16477

M3 - Preprint

SP - 1

EP - 12

BT - Show from Tell

PB - arXiv

ER -

Show from Tell: Audio-Visual Modelling in Clinical Settings

Abstract

Keywords

Access to Document

Fingerprint

Cite this