Talks

When the Rubber Meets the Road: Data Science on Track

Lei Cao

https://umd.zoom.us/j/94543765116?pwd=clY3MVV5Z1g4T2xpdnJMdjFiMFhYdz09

Wednesday, March 3, 2021, 1:00-2:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Many data scientists prefer high level, end-to-end interfaces, like SQL databases to make sense of data, since they abstract away low-level time consuming engineering details. However, except for SQL databases, few tools for data scientists today offer such high-level interfaces. The goal of my research is to bridge this gap, by developing systems and algorithms that automatically address low-level performance and scaling bottlenecks at every step in the data science pipeline, while still making it easy to incorporate domain-specific requirements.

My talk will cover two systems we have built, including an anomaly discovery system and a labeling system that solve fundamental problems in both unsupervised and supervised machine learning. First, AutoAD, the self-tuning component of our anomaly discovery system, targets freeing the data scientists from manually determining which among the large number of unsupervised anomaly detection techniques is the best suited for the given task and tuning the parameters for each of the alternate methods. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation. AutoAD solves this problem by using a fundamentally new strategy that unifies the merits of unsupervised anomaly detection and supervised classification. Second, our LANCET approach solves the labeling problem, a key bottleneck that limits the success of cutting-edge machine learning techniques in enterprise deployments. These techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Unfortunately, critical challenges in auto-labeling remain unsolved, including the following questions: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling. LANCET addresses all three challenges in an integrated framework based on a solid theoretical foundation characterizing the properties that the labeled dataset must satisfy to train an effective prediction model.

Bio

Lei Cao is a Postdoc Associate at the Computer Science and Artificial Intelligence Laboratory of MIT since November 2016, working with Prof. Samuel Madden and Prof. Michael Stonebraker. Before that he worked for IBM T.J. Watson Research Center as a Research Staff Member. He received my Ph.D. degree in Computer Science from Worcester Polytechnic Institute, supervised by Prof. Elke Rundensteiner. He has conducted research in the broad areas of data science and systems ranging from the low-level core database performance optimization to designing the high-level, application-specific machine learning techniques. His recent research falls in the emerging area of "Systems for AI and AI for Systems", focused on designing scalable algorithms and systems for the data scientists to effectively yet efficiently explore and discover knowledge from heterogeneous data sources -- especially anomalies.

This talk is organized by Richa Mathur