Images that Sound:
Composing Images and Sounds on a Single Canvas

arXiv 2024

University of Michigan
Correspondence to: ude.hcimu@gnayzc

tl;dr: We use diffusion models to generate visual spectrograms that look like images but can also be played as sound.

Note the above teaser is muted. For examples with sound, please see our gallery below.


Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt.

We describe our method in more detail and show examples in our gallery below.

Play the video to hear the audio!


We pose the problem of generating images that sound as a multimodal composition problem: our goal is to obtain a sample that is likely under both the distribution of images and the distribution of spectrograms. To do this, we simultaneously denoise using an image diffusion model and an audio diffusion model. Given a noisy latent \(\mathbf{z}_t\), we compute two text-conditioned noise estimates \(\boldsymbol{\epsilon}_{v}^{(t)}\) and \(\boldsymbol{\epsilon}_{a}^{(t)}\). One for each modality. We then obtain a multimodal noise estimate \(\tilde{\boldsymbol{\epsilon}}^{(t)}\) via weighted averaging, which we then use to denoise. Repeating this process iteratively results in a clean latent \(\mathbf{z}_0\). Finally, we decode this clean latent to a spectrogram and convert it into a waveform using a pretrained vocoder. As we only change the inference time procedure, our method is zero-shot, requiring no training or fine-tuning.

Iteratively denoising using both a spectrogram diffusion model and an image diffusion model. See above for videos with sound.

Related Links and Works

Various musicians have inserted images into spectrograms of their music, including Aphex Twin (go to 5:27), Nine Inch Nails, Venetian Snares, and in Doom's OST. Our work differs from these examples in that our spectrograms both look and sound natural.

Spectrogram Art, by Becky Buckle: an article about the history of artists concealing images in the spectrogram of their music.

SpectroGraphic, by Levi Borodenko: a tool to turn images into spectrograms and recover the corresponding audio.

Composable Diffusion, by Liu et al., which originally showed how to compose image diffusion models together.

Visual Anagrams, by Geng et al., which uses pretrained diffusion models and compositionality to make multi-view optical illusions.

Factorized Diffusion, by Geng et al., which generates various perceptual illusions via decomposition with diffusion models. We use their code the colorize the spectrograms.

Diffusion Illusions, by Burgert et al., which produces multi-view illusions, along with other visual effects, through score distillation sampling. We adapt their code to make an SDS style baseline for generating images that sound.


  title     = {Images that Sound: Composing Images and Sounds on a Single Canvas},
  author    = {Chen, Ziyang and Geng, Daniel and Owens, Andrew},
  year      = {2024},
  journal   = {arXiv preprint arXiv:2405.12221},
  url       = {},