Our method consists of two parts: (a) Cross-view audio prediction for repesentation learning; and (b) Sound localization from motion: jointly esimating sound direction and camera rotation.
Learning representation via spatialization
We first learn a feature representation by predicting how changes in images lead to changes in sound in a cross-view binauralization pretext task. We convert mono sound to binaural sound at a target viewpoint, after conditioning the model on observations from a source viewpoint.
Estimating pose and localizing sound
We use the representation to jointly solve two pose estimation tasks: visual rotation estimation and binaural sound localization. We train the models to make cross-modal geometric consistency that visual rotation angle, \(\phi_{s,t}\), to be consistent with the difference in predicted sound angles \(\theta_s\) and \(\theta_t\):
\[ \text{geometric consistency: } \phi_{s,t} = \theta_t- \theta_s\]