Online multichannel speech enhancement combining statistical signal processing and deep neural networks
Metadatos
Afficher la notice complèteAuteur
Martín Doñas, Juan M.Editorial
Universidad de Granada
Departamento
Universidad de Granada.; Universidad de Granada. Programa de Doctorado en Tecnologías de la Información y la ComunicaciónMateria
Online Multichannel Statistical signal Deep neural networks
Date
2021Fecha lectura
2021-01-25Referencia bibliográfica
Martín Doñas, Juan Manuel. Online multichannel speech enhancement combining statistical signal processing and deep neural networks. Granada: Universidad de Granada, 2021. [http://hdl.handle.net/10481/66402]
Patrocinador
Tesis Univ. Granada.Résumé
Speech-related applications on mobile devices require high-performance speech enhancement
algorithms to tackle challenging real-world noisy environments. These speech processing
techniques have to ensure good noise reduction capabilities with low speech distortion, thus
improving the perceptual speech quality and intelligibility of the enhanced speech signal. In
addition, current mobile devices often embed several microphones, allowing them to exploit
the spatial information during the enhancement procedure. On the other hand, low latency
and efficiency are requirements for extensive use of these technologies. Among the different
speech processing paradigms, statistical signal processing offers limited performance under
non-stationary noisy environments, while deep neural networks can lack generalization under
real conditions.
The main goal of this Thesis is the development of online multichannel speech enhancement
algorithms for speech services in mobile devices. The proposed techniques use
multichannel signal processing to increase the noise reduction performance without degrading
the quality of the speech signal. Moreover, deep neural networks are applied in specific
parts of the algorithm where modeling by classical methods would be, otherwise, difficult
or very limited. This allows for the use of more capable deep learning methods in real-time
online processing algorithms. Our contributions focus on different noisy environments where
these mobile speech technologies can be applied.
First, we develop a speech enhancement algorithm suitable for dual-microphone smartphones
used in noisy and reverberant environments. The noisy speech signal is processed
using a beamforming-plus-postfiltering strategy that exploits the dual-channel properties of
the clean speech and noise signals to obtain more accurate acoustic parameters. Thus, the
temporal variability of the relative transfer functions between acoustic channels is tracked
by using an extended Kalman filter framework. Noise statistics are obtained by means
of a recursive procedure using the speech presence probability. This speech presence is
estimated through either statistical spatial models or deep neural network mask estimators,
both exploiting dual-channel features from the noisy speech signal.
Then, we propose a recursive expectation-maximization framework for online multichannel
speech enhancement. The goal is the joint estimation of the clean speech statistics and the acoustic model parameters in order to increase robustness under non-stationary
conditions. The noisy speech signal is first processed using a beamformer followed by a
Kalman postfilter, which exploits the temporal correlations of the speech magnitude. The
speech presence probability is then obtained using a deep neural network mask estimator,
and its estimates are further refined through statistical spatial models defined for the noisy
speech and noise signals. The resulting clean speech and speech presence estimates are
then employed for maximum-likelihood estimation of beamformer and postfilter parameters.
This also allows for an iterative procedure with positive feedback between the estimation of
speech statistics and acoustic parameters.
Scenarios with multiple overlapped speakers are also analyzed in this Thesis. Thus,
beamforming with the model parameters obtained from deep neural network mask estimators
is also explored. To deal with interfering speakers, we study the use of adapted mask estimators
that exploit spectral and spatial information, obtained through auxiliary information, to
focus on a target speaker. Therefore, additional speech processing blocks are integrated into
the mask estimators so that the network can discriminate among different speakers. As an
application, we consider the problem of automatic speech recognition in meeting scenarios,
where our proposal can be used as a front-end processing.
Finally, we study the training of deep learning methods for speech processing using
perceptual considerations. Thus, we propose a loss function based on a perceptual quality
objective metric. We evaluate the proposed loss for training deep neural network-based singlechannel
speech enhancement algorithms in order to improve the speech quality perceived
by human listeners. The two most common approaches for single-channel processing using
these networks are considered: spectral mapping and spectral masking. We also explore the
combination of different objective metric-related loss functions in a multi-objective learning
training approach.
To conclude, we would like to highlight that our contributions successfully integrate
signal processing and deep learning methods to jointly exploit spectral, spatial, and temporal
speech features. As a result, the set of proposed techniques provides us with a manifold
framework for robust speech processing under very challenging acoustic environments, thus
allowing us to improve perceptual quality, intelligibility, and distortion measures.