Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He*, Linchao Bao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping be-tween audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to over-come this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating random sampling along with other techniques including relaxed motion loss, bicycle constraint, and diversity loss are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline. Code and more results are at https://jingli513.github.io/audio2gestures.
Original languageEnglish
Title of host publication2021 IEEE/CVF International Conference on Computer Vision (ICCV)
PublisherIEEE
Pages11273-11282
Number of pages10
ISBN (Electronic)9781665428125
ISBN (Print)9781665428132 (PoD)
DOIs
Publication statusPublished - 28 Feb 2022
Event2021 IEEE/CVF International Conference on Computer Vision (ICCV) - Montreal, QC, Canada
Duration: 10 Oct 202117 Oct 2021

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
PublisherIEEE
ISSN (Print)1550-5499
ISSN (Electronic)2380-7504

Conference

Conference2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Period10/10/2117/10/21

Keywords

  • Training
  • Computer vision
  • Codes
  • Three-dimensional displays
  • Correlation
  • Speech coding
  • Bicycles

Fingerprint

Dive into the research topics of 'Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders'. Together they form a unique fingerprint.

Cite this