Papers
arxiv:2507.22008

VeS: Teaching Pixels to Listen Without Supervision

Published on Jul 29
Authors:

Abstract

Dense token routing in multilingual audio-visual models improves retrieval and localization performance, especially in low-resource settings.

AI-generated summary

Recent dense audio-visual (AV) models achieve impressive retrieval and emergent localization, but almost all evidence comes from English-centric, caption-rich web video. It is unclear whether these objectives survive in low-resource, code-switched, and noisy multilingual settings that typify developing regions. We show they do**-**and that the choice of aggregation function becomes even more critical. Using a multilingual subset of Project Vaani spanning dozens of Indian languages and dialectal variants, we compare three contrastive objectives: (i) a global mean-pooled loss (CLIP-style), (ii) a dense max-mean token matcher (DenseAV-style), and (iii) a simple hybrid (motivated by frozen-vision alignment strategies). The dense objective delivers a +59% relative R@1 (Audio Visual) improvement over global pooling and substantially lower mean/median ranks, while consistently producing sharp zero-shot localization heatmaps of spoken objects-despite keeping the vision backbone entirely frozen (no LoRA / partial fine-tuning). Our results demonstrate that dense token routing is not a luxury of high-resource English corpora; it is more decisive when annotations and acoustic cleanliness are scarce. We release the codebase and trained models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.22008 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.22008 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.22008 in a Space README.md to link it from this page.

Collections including this paper 1