Abstract: Audio-visual zero-shot learning (ZSL) leverages both video and audio information for model training, aiming to classify new video categories that were not seen during the training. However, ...
Abstract: Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the ...
When CBS executives viewed a completed cut of “A Charlie Brown Christmas” in 1965 — 10 days before it was scheduled to air for the first time — they were horrified. “They hated it,” producer Lee ...
We base our datasets on the AVCA repository. The dataset structure is identical to AVCA and the dataset folder is called avgzsl_benchmark_non_averaged_datasets. The only difference is that we use ...
In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in ...
Meta describes SAM Audio as a unified AI audio model that uses text-based commands, visual cues, and time-based instructions to identify and separate sounds from a complex mixture. Traditionally, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results