Images that Sound: Multimodal AI Art

Ziyang Chen         Daniel Geng         Andrew Owens
University of Michigan
Correspondence to: ude.hcimu@gnayzc

tl;dr: We use diffusion models to generate multimodal art images that can be both seen and heard.


We explore and present a novel type of multimodal art called Images that Sound, leveraging diffusion models. Our objective is to create images that not only captivate visually but also offer an auditory experience. These images, when converted to grayscale, seamlessly transform into audio spectrograms, offering a immersive multimedia experiences that engage both the eyes and ears. To craft such artwork, we use pretrained text-to-image and text-to-audio diffusion models to jointly optimize the image pixels through score distillation sampling or guide denosing processing from them. Finally, we use pretrained vocoder to convert spectrograms to audio. We show some examples in the gallery below.

Play the video to hear the audio!

More videos will be coming soon!

Related Links

This work is inspired by prior work in related areas, including:

Aphex Twin, a musician who hide his face in the spectrogram of his songs.

SpectroGraphic, which turns an image into sound whose spectrogram looks like the image.

Visual Anagrams, by Daniel Geng et al., which uses pretrained diffusion models to make multi-view optical illusions.

Diffusion Illusions, by Ryan Burgert et al., which produces multi-view illusions, along with other visual effects, through score distillation sampling.


  title     = {Images that Sound},
  author    = {Chen, Ziyang and Geng, Daniel and Owens, Andrew},
  year      = {2024},
  url       = {},