Also on Zoom - https://umd.zoom.us/j/96718034173?pwd=clNJRks5SzNUcGVxYmxkcVJGNDB4dz09
Our ability to collect data continues to grow at an exponential rate; combine this with the abundance of local compute and storage capacities, increasingly decentralized teams of data analysts, and the almost-innate fear of ever deleting anything, and the result is a proliferation of many thousands or millions of versions of almost-similar datasets in most enterprises. This not only leads to increased storage and network costs, but also quickly grows unmanageable due to the difficulty in maintaining sufficient context like dataset provenance. Data compression is typically not sufficient by itself to address these challenges, in part because we often need to retrieve or query specific datasets or portions thereof, and in part because the data is usually stored in distributed cloud-based (semi-)structured data management systems. In this talk, I will discuss our work over the last decade on systematically understanding the storage/retrieval/query tradeoffs in this context, and describe how different use cases, computing environments, and data types lead to different solutions. I will also discuss how we can enable new types of introspective analyses of data evolution and data processing pipelines, and future research directions.
Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from the University of California at Berkeley in 2004. His research interests include collaborative data science platforms, provenance, uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences. He is also a Co-Founder and Chief Scientist at WireWheel, Inc., which is building a comprehensive platform to help companies comply with data privacy regulations like GDPR, CCPA, and others.