log in  |  register  |  feedback?  |  help  |  web accessibility
Speech and Audio Developments and Challenges in Industry
Sefik Emre Eskimez
IRB 4105 or Zoom https://umd.zoom.us/j/5812269672?pwd=EitPbAg5hk05MV1yh7Zby0Ej5UF5mP.1&omn=97688642680
Thursday, April 16, 2026, 12:30-1:30 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

This talk examines how conversational voice agents are built in industry, tracing the evolution from cascaded pipelines (ASR —>  LLM —> TTS) to thinker-talker and end-to-end architectures. We discuss key design decisions including audio representations, turn-taking, and full-duplex interaction, and address the practical challenges that separate research from production: training data at scale, evaluation beyond offline metrics, robustness under real-world conditions, echo cancellation, noise handling, and latency budgeting.

Bio

Sefik Emre Eskimez is a Research Engineer at Sesame and a former Principal Researcher at Microsoft, with over 10 years of experience in deep learning and machine learning. His research focuses on speech and audio processing, including text-to-speech synthesis, speech enhancement, generative models, and conversational voice AI. At Microsoft, he led the development and deployment of Voice Isolation (Personalized Speech Enhancement) for Microsoft Teams, a real-time system that extracts a target speaker's voice from noisy audio and is now used by millions of users worldwide. He also contributed to multiple state-of-the-art TTS systems, including E2 TTS and SpeechX. At Sesame, he works on next-generation conversational voice AI, including the Conversational Speech Model (CSM) and turn-taking research aimed at making voice interactions more natural and human-like. Sefik Emre Eskimez holds a Ph.D. in Electrical and Computer Engineering from the University of Rochester, where his dissertation focused on generating emotionally expressive talking faces from speech. He has published many papers in top venues including ICASSP, Interspeech, and IEEE/ACM Transactions on Audio, Speech, and Language Processing, and holds multiple U.S. patents.

This talk is organized by Samuel Malede Zewdu