Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions

Michael Joannou; Pia Rotshtein; Uta Noppeney

doi:10.1371/journal.pone.0301098

Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions

Michael Joannou^*, Pia Rotshtein, Uta Noppeney

^*Corresponding author for this work

Psychology

Research output: Contribution to journal › Article › peer-review

73 Downloads (Pure)

Abstract

We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

Original language	English
Article number	e0301098
Number of pages	19
Journal	PLoS ONE
Volume	19
Issue number	4
DOIs	https://doi.org/10.1371/journal.pone.0301098
Publication status	Published - 1 Apr 2024

Bibliographical note

Funding:
This research was funded by an Engineering and Physical Sciences Research Council (ESPRC) National Productivity Investment Fund (NPIF) studentship (MJ) and a European Research Council (ERC) starting grant: multsens (UN). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Access to Document

10.1371/journal.pone.0301098Licence: Creative Commons: Attribution (CC BY)

JoannouM2024AudiovisualFinal published version, 2.14 MBLicence: Creative Commons: Attribution (CC BY)

Cite this

@article{c881ab1fd4ca4b69b9ed69212ee8ab5d,

title = "Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions",

abstract = "We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.",

author = "Michael Joannou and Pia Rotshtein and Uta Noppeney",

note = "Funding: This research was funded by an Engineering and Physical Sciences Research Council (ESPRC) National Productivity Investment Fund (NPIF) studentship (MJ) and a European Research Council (ERC) starting grant: multsens (UN). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.",

year = "2024",

month = apr,

day = "1",

doi = "10.1371/journal.pone.0301098",

language = "English",

volume = "19",

journal = "PLoS ONE",

issn = "1932-6203",

publisher = "Public Library of Science (PLOS)",

number = "4",

}

TY - JOUR

T1 - Audiovisual Moments in Time

T2 - A large-scale annotated dataset of audiovisual actions

AU - Joannou, Michael

AU - Rotshtein, Pia

AU - Noppeney, Uta

N1 - Funding: This research was funded by an Engineering and Physical Sciences Research Council (ESPRC) National Productivity Investment Fund (NPIF) studentship (MJ) and a European Research Council (ERC) starting grant: multsens (UN). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

PY - 2024/4/1

Y1 - 2024/4/1

N2 - We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

AB - We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. Additionally, we introduce the Supervised Audiovisual Correspondence (SAVC) task whereby a classifier must discern whether audio and visual streams correspond to the same action label. We trained 6 RNNs on the SAVC task, with or without AVMIT-filtering, to explore whether AVMIT is helpful for cross-modal learning. In all RNNs, accuracy improved by 2.09-19.16% with AVMIT-filtered data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

U2 - 10.1371/journal.pone.0301098

DO - 10.1371/journal.pone.0301098

M3 - Article

SN - 1932-6203

VL - 19

JO - PLoS ONE

JF - PLoS ONE

IS - 4

M1 - e0301098

ER -

Audiovisual Moments in Time: A large-scale annotated dataset of audiovisual actions

Abstract

Bibliographical note

Access to Document

Fingerprint

Cite this