log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Proposal: Towards Unified Metadata Management of Data Lake Systems
Keonwoo Oh
IRB-5137 https://umd.zoom.us/j/6060059703?pwd=XijRCLNpxDIo1kUPoFbn3dYza0DQRS.1&omn=97508823749&jst=2
Wednesday, November 19, 2025, 11:00 am-12:30 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Rapid growth in the volume and heterogeneity of data, advent of new hardware architectures, and demand for new use cases pose various challenges to the metadata management of large scale data systems. Despite its importance, the subject of metadata has received relatively little attention in the database community on the grounds that metadata, after all, is also data and that the same principles that govern data management also apply to metadata management.

However, as both the scale and diversity of data systems increase over time, we observe a divergence between metadata management and data management in various functional and systematic requirements. One unique challenge of metadata management is integration across disparate systems, lack of which can result in issues including data duplication, no data consistency and integrity guarantees, and loss of important contextual information. As such, we aim to construct unified frameworks and systems that can address these challenges holistically while allowing flexibility needed for incorporating new systems in the future.

We first present TreeCat, a standalone catalog engine whose unified architecture enables low latency metadata operations and advanced transaction processing, currently not supported by state-of-art lakehouse systems. Next, we identify the garbage collection problem in the context of data lakes and present the preliminary design of the system that is currently under development. Lastly, we investigate performance challenges of storing and managing multimodal data sets in tabular format. For the latter two projects, we propose comprehensive future plans and identify key milestones.

Bio

Keonwoo Oh is a PhD student in Computer Science at the University of Maryland, advised by Prof. Amol Deshpande. His research focuses on metadata management of database systems.

This talk is organized by Migo Gui