Talks

Storage-Compression-Querying Tradeoffs in Dataset Versioning

Amol Deshpande

IRB 0318

Friday, October 8, 2021, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Also on Zoom - https://umd.zoom.us/j/96718034173?pwd=clNJRks5SzNUcGVxYmxkcVJGNDB4dz09

Our ability to collect data continues to grow at an exponential rate; combine this with the abundance of local compute and storage capacities, increasingly decentralized teams of data analysts, and the almost-innate fear of ever deleting anything, and the result is a proliferation of many thousands or millions of versions of almost-similar datasets in most enterprises. This not only leads to increased storage and network costs, but also quickly grows unmanageable due to the difficulty in maintaining sufficient context like dataset provenance. Data compression is typically not sufficient by itself to address these challenges, in part because we often need to retrieve or query specific datasets or portions thereof, and in part because the data is usually stored in distributed cloud-based (semi-)structured data management systems. In this talk, I will discuss our work over the last decade on systematically understanding the storage/retrieval/query tradeoffs in this context, and describe how different use cases, computing environments, and data types lead to different solutions. I will also discuss how we can enable new types of introspective analyses of data evolution and data processing pipelines, and future research directions.

Bio

Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from the University of California at Berkeley in 2004. His research interests include collaborative data science platforms, provenance, uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences. He is also a Co-Founder and Chief Scientist at WireWheel, Inc., which is building a comprehensive platform to help companies comply with data privacy regulations like GDPR, CCPA, and others.

This talk is organized by Richa Mathur