Retaining Semantics in Image to Music Conversion

Zeyu Xiong*, Pei Chun Lin, Amin Farjudian

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution


We propose a method for generating music from a given image through three stages of translation, from image to caption, caption to lyrics, and lyrics to instrumental music, which forms the content to be combined with a given style. We train our proposed model, which we call BGT (BLIP-GPT2-TeleMelody), on two open-source datasets, one containing over 200,000 labeled images, and another containing more than 175,000 MIDI music files. In contrast with pixel level translation, the BGT model retains the semantics of the input image. We verify our claim through a user study in which participants were asked to match input images with generated music without access to the intermediate caption and lyrics. The results show that, while the matching rate among participants with low music expertise is essentially random, the rate among those with composition experience is significantly high, which strongly indicates that some semantic content of the input image is retained in the generated music.

Original languageEnglish
Title of host publication2022 IEEE International Symposium on Multimedia (ISM)
Number of pages8
ISBN (Electronic)9781665471725
ISBN (Print)9781665471732
Publication statusPublished - 23 Jan 2023
Event24th IEEE International Symposium on Multimedia, ISM 2022 - Virtual, Online, Italy
Duration: 5 Dec 20227 Dec 2022

Publication series

NameIEEE International Symposium on Multimedia
ISSN (Electronic)2766-0001


Conference24th IEEE International Symposium on Multimedia, ISM 2022
CityVirtual, Online

Bibliographical note

Funding Information:
This research was supported by the Ministry of Education, R.O.C., under the grant TEEP@AsiaPlus, the Ministry of Science and Technology, R.O.C., under the grant No. MOST 109-2221-E-035-063-MY2, and by Feng Chia University under the 2022 Project Research Grant.

Publisher Copyright:
© 2022 IEEE.


  • machine learning
  • media composition
  • media semantics

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Signal Processing
  • Media Technology


Dive into the research topics of 'Retaining Semantics in Image to Music Conversion'. Together they form a unique fingerprint.

Cite this