Great advances have been achieved in speech recognition by using supervised models trained on huge amounts of labelled resources. However, for many languages, annotating speech is expensive and sometimes impossible, e.g., when dealing with endangered or unwritten languages. There is therefore growing interest in methods that can learn directly from raw speech, without access to transcriptions, pronunciation dictionaries, or language modelling text. There is also interest in learning from unlabelled speech paired with another modality (such as images). Both of these settings are important for developing speech systems in low-resource scenarios, and for potentially shedding light on language acquisition in humans.
In the first part of the talk, I will introduce our unsupervised segmental Bayesian model that segments and clusters unlabelled speech into word-like units. This system is trained directly on raw speech, without access to any transcriptions. In the second part of the talk, I will present our recent work which uses images paired with spoken captions to ground unlabelled speech. Without seeing any parallel speech and text, the resulting neural network model can be used as a keyword spotter, predicting which utterances in a speech collection contain a given textual keyword.
Herman is currently a postdoc at TTI-Chicago, working with Karen Livescu and Greg Shakhnarovich. He completed his PhD at the University of Edinburgh, where he was supervised by Sharon Goldwater, Aren Jansen and Simon King. Before that, he was a research associate at Stellenbosch University, South Africa. His main interests are in unsupervised and low-resource machine learning methods, in particular applied to problems in speech, vision, and language processing.