Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Guy, Sylvain; Lathuilière, Stéphane; Mesejo Santiago, Pablo; Horaud, Radu

Learning_Visual_Voice_Activity PREPRINT.pdf (6.407Mb)

Identificadores

URI: http://hdl.handle.net/10481/70588

Exportar

Editorial

IEEE

Date

2020-10-16

Referencia bibliográfica

Published version: S. Guy... [et al.]. "Learning Visual Voice Activity Detection with an Automatically Annotated Dataset," 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp. 4851-4856, doi: [10.1109/ICPR48806.2021.9412884]

Sponsorship

European Commission 871245 SPRING; Multidisciplinary Institute in Artificial Intelligence (MIAI) ANR-19-P3IA-0003

Abstract

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. VVAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing VVAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets inthe- wild – WildVVAD – based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

Collections

OpenAIRE (Open Access Infrastructure for Research in Europe)

Except where otherwise noted, this item's license is described as Atribución-NoComercial-SinDerivadas 3.0 España