Talks

Articulatory phonological approach to address acoustic variability in Automatic Speech Recognition

Ganesh Sivaraman - University of Maryland

Wednesday, April 27, 2016, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

The past decade has seen a tremendous improvement in Automatic Speech Recognition (ASR) systems due to the Deep Neural Network (DNN) revolution. In spite of this phenomenal improvement in performance, the state-of-the-art still lags behind human speech recognition. This gap is predominantly due to the lack of robustness against speech variability. Speech acoustic patterns vary significantly as a result of coarticulation and lenition processes that are shaped by segmental context or by performance factors such as speech production rate and degree of casualness. The resultant acoustic variability continues to offer serious challenges for the development of ASR systems.

Articulatory phonology analyzes speech as a constellation of coordinated articulatory gestures performed by the articulators in the vocal tract (lips, tongue tip, tongue body, jaw, glottis and velum). According to this theory, the acoustic variability can be explained by the temporal of gestures and their reduction in space. Coarticulation and lenition are due to the overlap of neighboring gestures. In order to analyze speech as articulatory gestures, we need to first estimate them from the speech signal.

We propose to characterize articulatory features as vocal tract constriction variables (also known as Tract Variables (TVs)). Artificial Neural Network based systems are trained to estimate the TVs from speech signal. These systems are called speech inversion systems. We then develop speaker normalization schemes to address the speaker variability in the acoustic and articulatory domains in order to perform reliable speaker independent speech inversion. We analyze the speech inversion system on an articulatory dataset containing speech rate variations see if the model is able to reliably predict the TVs in challenging coarticulatory scenario. This talk will present few examples of coarticualted speech analyzed using the TVs estimated from our speech inversion system. Finally, we propose a couple of possible approaches to recover articulatory gestures from speech to develop a gesture based ASR system.

Bio

Ganesh Sivaraman is a PhD candidate in the ECE department at the University of Maryland College Park. His PhD advisor is Prof. Carol Espy-Wilson. His research interests are in speech production, articulatory phonology, Automatic Speech Recognition and applied machine learning.

This talk is organized by Naomi Feldman