In this work we propose a three-pronged investigation into algorithmic foundations and applications. The first area is string algorithms for computational biology and metagenomics specifically where the data deluge from next generation sequencing drives the necessity for new algorithms that are both fast and accurate. For metagenomic analysis, we discuss ongoing work on a tool for clustering a microbial marker gene, the 16S ribosomal RNA gene. Looking more broadly, we also consider more theoretical work on string comparison, describing our prior work on edit distance computation and the maximum duo-preservation string mapping (MPSM) problem. We then propose additional problems in string algorithms that we plan to explore. A key challenge at the heart of these problems is the stochasticity of both DNA sequencing errors and biological mutations.
The second area is constrained clustering where we explore classical clustering problems with additional constraints on which data points can or cannot be clustered together. Utilizing these constraints is important for many clustering problems because they can be used to ensure fairness, exploit expert advice, or capture natural properties of the data. In simplest case, this can mean some pairs of points have ``must-link'' constraints requiring that that they must be clustered together. Moving into stochastic settings, we can describe more general pairwise constraints such as bounding the probability that two points are separated into different clusters. This lets us introduce a new notion of fairness for clustering and address stochastic problems such as semi-supervised learning with advice from imperfect experts.
The third area is online matching algorithms for e-commerce applications such as online sales and advertising. The importance of e-commerce in modern business cannot be overstated and even minor algorithmic improvements can have huge impacts. In online matching problems we generally have a known offline set of goods or advertisements while users arrive online and allocations must be made immediately and irrevocably when a user arrives. However, in the real world, there is uncertainty about a user's true interests and this can be modeled by considering matching problems in a graph with stochastic edges that only have a probability of existing. These edges represent the probability of a user purchasing a product or clicking on an ad. Surprisingly, there are many natural and well-motivated stochastic generalizations of online matching that are still under-explored. We describe past and future work on addressing this.
Dept. rep: Dr. Jordan Boyd-Graber
Members: Dr. Samir Khuller