Images that Sound

Overview

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt.

We describe our method in more detail and show examples in our gallery below.

Gallery

Play the video to hear the audio!

Colorful Images that Sound

Grayscale Images that Sound

Method

We pose the problem of generating images that sound as a multimodal composition problem: our goal is to obtain a sample that is likely under both the distribution of images and the distribution of spectrograms. To do this, we simultaneously denoise using an image diffusion model and an audio diffusion model. Given a noisy latent \(\mathbf{z}_t\), we compute two text-conditioned noise estimates \(\boldsymbol{\epsilon}_{v}^{(t)}\) and \(\boldsymbol{\epsilon}_{a}^{(t)}\). One for each modality. We then obtain a multimodal noise estimate \(\tilde{\boldsymbol{\epsilon}}^{(t)}\) via weighted averaging, which we then use to denoise. Repeating this process iteratively results in a clean latent \(\mathbf{z}_0\). Finally, we decode this clean latent to a spectrogram and convert it into a waveform using a pretrained vocoder. As we only change the inference time procedure, our method is zero-shot, requiring no training or fine-tuning.

Iteratively denoising using both a spectrogram diffusion model and an image diffusion model. See above for videos with sound.

BibTeX

@article{chen2024images,
  title     = {Images that Sound: Composing Images and Sounds on a Single Canvas},
  author    = {Chen, Ziyang and Geng, Daniel and Owens, Andrew},
  journal = {Neural Information Processing Systems (NeurIPS)},
  year      = {2024},
  url       = {https://ificl.github.io/images-that-sound/},
}

Images that Sound:
Composing Images and Sounds on a Single Canvas

NeurIPS 2024

tl;dr: We use diffusion models to generate visual spectrograms that look like images but can also be played as sound.

Overview

Gallery

Colorful Images that Sound

Grayscale Images that Sound

Method

Related Links and Works

BibTeX

Images that Sound:Composing Images and Sounds on a Single Canvas

NeurIPS 2024

tl;dr: We use diffusion models to generate visual spectrograms that look like images but can also be played as sound.

Overview

Gallery

Colorful Images that Sound

Grayscale Images that Sound

Method

Related Links and Works

BibTeX

Images that Sound:
Composing Images and Sounds on a Single Canvas