PhD Proposal: Sublinear Algorithms for Processing Massive Datasets

Hamed Saleh

Remote

Abstract

Designing algorithms which are sublinear in space is an inevitable scenario when the input data does not fit in the memory of available systems. In recent years, with the abundance of data and the increasing demand for large scale data processing, the size of datasets easily surpasses that of the typical machine’s memory. Examples of the problem domain vary from social network graphs to DNA sequences. However, to address any problem subject to this restriction efficiently, we need alternative models of computation as the traditional RAM model is no longer an option. In this manuscript, we study sublinear space algorithms in two computation models, Massively Parallel Computation and Streaming, that enable us to overcome this challenge.

During the last decade, the Massively Parallel computation (MPC) model attracted a considerable amount of attention. The MPC model was originally proposed to provide a theoretical foundation to algorithms implemented on modern large scale data processing frameworks such as MapReduce, Hadoop, and Spark. The key idea behind this model, and also aforementioned frameworks, is to use many machines to compensate for the shortage of space in individual machines. In this model, the data is distributed among a set of machines each with a sublinear memory, and the process is consisted of several rounds. In each round, machines perform an arbitrary amount of computation on their local data independently. At the end of each round, machines can communicate with each other. We study the Edge Coloring problem in this model, and we show a constant-round MPC algorithm can produce a proper coloring with almost optimal number of colors. In addition, we come up with truly-sublinear MPC algorithms for multiple variants of the String Matching problem in the presence of different wildcards {‘?’, ‘+’, ‘*’}.

The Streaming model is yet another approach that allows us to process massive data in a sublinear space. The streaming model was introduced earlier in comparison to MPC, and it is a well-known model with a plethora of results already known about it. In contrast to MPC, there is only one machine in the streaming model. The process is consisted of several passes, and the input entries arrive sequentially and one by one in each pass. The machine needs to manage its sublinear space while processing the entries upon arrival. The order by which the entries arrive could be either random or adversarial. We study the Edge Coloring problem in both random and adversarial order, and we show there are 1-pass algorithms in the special “W-streaming” model, with different guarantees for the number of colors used in each case. In the W-streaming model, the output is also returned in a streaming fashion since the output of the edge coloring problem has the same size as its input.

Examining Committee:

During the last decade, the Massively Parallel computation (MPC) model attracted a considerable amount of attention. The MPC model was originally proposed to provide a theoretical foundation to algorithms implemented on modern large scale data processing frameworks such as MapReduce, Hadoop, and Spark. The key idea behind this model, and also aforementioned frameworks, is to use many machines to compensate for the shortage of space in individual machines. In this model, the data is distributed among a set of machines each with a sublinear memory, and the process is consisted of several rounds. In each round, machines perform an arbitrary amount of computation on their local data independently. At the end of each round, machines can communicate with each other. We study the Edge Coloring problem in this model, and we show a constant-round MPC algorithm can produce a proper coloring with almost optimal number of colors. In addition, we come up with truly-sublinear MPC algorithms for multiple variants of the String Matching problem in the presence of different wildcards {‘?’, ‘+’, ‘*’}.

The Streaming model is yet another approach that allows us to process massive data in a sublinear space. The streaming model was introduced earlier in comparison to MPC, and it is a well-known model with a plethora of results already known about it. In contrast to MPC, there is only one machine in the streaming model. The process is consisted of several passes, and the input entries arrive sequentially and one by one in each pass. The machine needs to manage its sublinear space while processing the entries upon arrival. The order by which the entries arrive could be either random or adversarial. We study the Edge Coloring problem in both random and adversarial order, and we show there are 1-pass algorithms in the special “W-streaming” model, with different guarantees for the number of colors used in each case. In the W-streaming model, the output is also returned in a streaming fashion since the output of the edge coloring problem has the same size as its input.

Examining Committee:

Chair: Dr. MohammadTaghi Hajiaghayi

Dept rep: Dr. John Dickerson

Members: Dr. David Mount

Dept rep: Dr. John Dickerson

Members: Dr. David Mount

Bio

Hamed Saleh is a PhD student in the CS Department at the University of Maryland, working under the supervision of Prof. Hajiaghayi. His research interests includes algorithmic problems in distributed/parallel models of computation, especially models with sublinear memory such as MPC and Streaming.

This talk is organized by Tom Hurst