Sound Localization by Self-Supervised Time Delay Estimation

Ziyang Chen
David F. Fouhey
Andrew Owens

University of Michigan

ECCV 2022


Given a stereo audio recording, we estimate a sound's interaural time delay. Our model learns through self-supervision to find correspondences between the signals in each channel, from which the time delay can be estimated. We show time delay predictions for two scenes, along with their corresponding video frames (not used by the model). In both cases, the sound source changes its position in a scene, resulting in a corresponding change in time delay.


Sounds in the world arrive at one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their direction. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation for binaural matching, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face.


In-the-wild Evaluation Dataset       We collected 30 internet binaural videos, extracted 1K samples from them, and use human judgments to label sound directions. These videos contain a variety of sounds, including engine noise and human speech. Dataset could be download from our github repo. Our method can predict more than left vs. right, we just do that to simplify evaluation.


Qualitative Results

In-the-wild video results

Visually-guided time delay estimation

Binaural car demo

iPhone video demo (Video Credits)

Paper and Supplementary Material

Ziyang Chen, David F. Fouhey, Andrew Owens.
Sound Localization by Self-Supervised Time Delay Estimation.
arXiv 2022.



This work was funded in part by DARPA Semafor and Cisco Systems. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The webpage template was originally made by Phillip Isola and Richard Zhang for a Colorization project.