Talks

Query Processing for Big Data

4105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Monday, February 26, 2024, 1:00-2:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Query processing has been one of the core problems in databases since the 1970s while we are still on the way to understanding the theoretical limit of many basic problems in query processing, which probably, in turn, brings fundamentally new ideas to improve the way we process queries in practice. In the last twenty years, the need to process and analyze big data has invigorated this long-time research area with fresh challenges. Massively parallel data systems, such as MapReduce and Spark, have become effective tools for handling large volumes of data, while query evaluation algorithms in these systems have to be designed to scale to thousands of machines in parallel. In addition, data is generated at very high speeds, which requires the query engine to deliver timely answers over dynamic databases. Moreover, data is collected from multiple sources and contains sensitive information, so query evaluation algorithms should be oblivious to the input database and provide a privacy guarantee in the answers. Beyond the traditional goal of efficiency, my research has also aimed at equipping query evaluation algorithms in modern data analytical systems with new features, such as scalability, low latency, and privacy.

In this talk, I will focus on query evaluation for massively parallel systems and go deep into natural join queries, the most fundamental and practically important class of queries. I will describe the intrinsic relationship between the join structure and its parallel computational cost under different optimality guarantees. Finally, I will discuss some exciting connections between query evaluation and neighboring communities, such as high-performance computing, algebra, machine learning, and privacy.

Bio

Xiao Hu is a Visiting Assistant Researcher in EECS at UC Berkeley. Before that, she worked as a Postdoctoral Associate at Duke University, a Visiting Faculty Researcher at Google Research, and a Research Fellow at Simons Institute for the Theory of Computing. She received her PhD in Computer Science and Engineering from HKUST and her BE degree from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications for practical systems, which include massively parallel query evaluation, dynamic query evaluation, data privacy, and learning theory in query evaluation. Her research has been published in top journals and conferences, such as JACM, SIGMOD, VLDB, PODS, ICDT, and NeurIPS. Her work has also been invited to ACM Transactions on Database Systems, Database Principles Column in SIGMOD Record, and Logical Methods in Computer Science. Her work has received the ICDT 2024 Best Paper Award.

This talk is organized by Amol