The complexity of many analytics workflows has outreached the capabilities of current platforms. While modern solutions can still showcase their advantages on a subset of analytics applications, they require specific data formats and query inputs, being able to utilize (and optimize) only their custom execution engine. This fails to cope with the data and task heterogeneity of modern workflows. The need for a multi-engine approach, that splits and coordinates workflow execution among multiple collaborating engines and datastores has been recently recognized and is gaining increasing attention. In this talk I will present our two recent systems on optimizing multi-engine workflows, namely IReS (Intelligent Resource Scheduler) and MuSQLE (Distributed SQL Query Execution Over Multiple Engine Environments).
IReS is able to optimize a workflow with respect to a user-defined policy, relying on cost and performance models of the required tasks over the available platforms. This optimization consists of allocating distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and deciding on the exact amount of resources provisioned. MuSQLE efficiently integrates engines -for SQL-based workflows- allowing for both intra and inter engine optimizations. It adopts a novel API-based strategy: Instead of manual integration, MuSQLE specifies a generic API, used for the cost estimation and query execution, that needs to be implemented for each SQL engine endpoint. Our system is integrated with a state-of-the-art query optimizer, adding support for location-based, multi-engine query optimization and letting individual runtimes perform sub-query physical optimization. The derived multi-engine plans are executed using the Spark distributed execution framework.
Dimitrios Tsoumakos is an Assistant Professor in the Department of Informatics of the Ionian University. He is also a collaborating researcher with the Computing Systems Laboratory of the National Technical University of Athens (NTUA). He received his Diploma in Electrical and Computer Engineering from NTUA in 1999, then joined the graduate program in Computer Sciences at the University of Maryland in 2000, where he received his M.Sc. (2002) and Ph.D. (2006). His research interests relate to both systemic and data management aspects of Distributed, Large-Scale Systems. Currently, he is focusing on algorithms for big-data management relative to: Optimization of multi-engine analytics, RDF graphs and data compression.