Analyzing interconnection structures among underlying entities or objects in a dataset through the use of graph analytics has been shown to provide tremendous value in many application domains. This value has been gained through the use of a variety of graph analysis tasks and graph algorithms (e.g., community detection, influence propagation, network evolution, anomaly detection, centrality analysis etc.), which are relevant in a variety of different market product applications as well as scientific domains. In recent years, there have been a variety of specialized graph data management systems, and analysis frameworks proposed, which have made significant advances in efficiently storing and analyzing graph-structured data. However, the vast majority of the data users are interested in analyzing is not stored in a native graph store but likely in traditional relational databases, or similar structured storage systems like key-value stores, governed by some sort of schema. The cumbersome ETL process required to move data to specialized systems creates barriers that slow down and discourage graph analytics; our vision is to break these barriers.
We propose the development of a system called GraphGen, implemented as a “bolt-on layer” on top of loosely-structured databases, which provides a unified solution to graph analytics that would open up both graph extraction and analysis over any dataset, regardless of how it is stored. Through leveraging an expressive declarative language, GraphGen enables efficient graph analytics over “hidden” graphs that exist within datasets. The graphs are “hidden” in the sense that the edge relationships are not explicitly stored, but can be derived from combinations of the separate portions of the dataset, usually through joins. To achieve this we develop a series of in-memory representations that deal with large-output joins and edge duplication issues inherent to the underlying database and showcase the benefits and trade-offs entailed.
Declarative language abstractions, apart from being intuitive for users, also introduce great potential for transparent optimizations across many aspects of the analysis. We propose both graph definition (GraphGenDL) and graph querying (GraphGenQL) language abstractions and study optimizations across both tasks. In order to make our case, we focus on a specific type of pattern counting queries and showcase orders of magnitude performance improvements made possible by the combination of the two language abstractions.
Finally, moving towards the above vision, we propose to build an optimization framework that (a) dissects these graph analytics workloads into multiple tasks, and makes automatic decisions regarding whether each task should be handed off to the underlying database, or be executed within the GraphGen layer, (b) pre-materializes portions of often-queried graphs and deals with efficiently keeping them up-to-date after updates to the underlying database, and (c) optimizes subgraph or ego-graph analysis by interleaving extraction and analysis across sets of subgraphs.
Dean's rep: Dr. Pete Keleher
Members: Dr. Daniel Abadi