MultiFoley

Abstract

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods.

Foley Generation with Text Control

We generate a synchronized soundtrack for slient videos given a text prompt. In the examples below you can select the text prompt using the buttons on the right of the video to hear the corresponding generated audio for that video. (Click the button to play the video, and click it again to pause.)

Foley Generation with Audio Control

Our model also allows users to generate Foley sound with reference audio from sound effects libraries. We show some examples below.

Foley Audio Extension

Given a video with partial soundtracks, our model can extend it to be a complete Foley. We show some examples below.

BibTeX

@inproceedings{chen2024multifoley,
      author    = { Chen, Ziyang and 
                    Seetharaman, Prem and 
                    Russell, Bryan and 
                    Nieto, Oriol and
                    Bourgin, David and
                    Owens, Andrew and 
                    Salamon, Justin},
      title     = {Video-Guided Foley Sound Generation with Multimodal Controls},
      journal   = {The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)},
      year      = {2025},
    }

Video-Guided Foley Sound Generation with Multimodal Controls

CVPR 2025

tl;dr: We introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video.

Abstract

Foley Generation with Text Control

Foley Generation with Audio Control

Foley Audio Extension

BibTeX