PhD Proposal: Coresets for the Nearest-Neighbor Rule

Alejandro Flores-Velazco

Abstract

Given a set P of labeled points in a metric space, the problem of nearest-neighbor condensation deals with finding a subset R of P such that for every point p in P the nearest-neighbor of p in R has the same label as p. This is motivated by applications in classification, where the nearest-neighbor rule is a well-known classifier that assigns to an unlabeled query point the label of its nearest-neighbor in the point set. In this context, condensation aims to reduce the size of the set needed to classify new points. This relates to the idea of coresets, which can be broadly defined as small representative subsets of the data, such that the result of querying a coreset approximates the result otherwise obtained when querying the original data. However, finding such condensed subsets of minimum cardinality is known to be NP-hard, and most research has focused on proposing practical heuristics without any performance guarantees. Additionally, in the literature, the use of exact nearest-neighbor queries is always assumed, ignoring the effect that condensation might have in the classification accuracy of the nearest-neighbor rule when queries are computed approximately.

My preliminary research addresses these shortcomings by proposing approximation-sensitive criteria for the nearest-neighbor condensation problem, along with the first practical condensation algorithms with provable performance guarantees. First, I characterize sufficient conditions to guarantee the correct classification of unlabeled points when using approximate nearest-neighbor queries on these subsets, which introduces the notion of coresets for classification with the nearest-neighbor rule. Moreover, I prove that it is NP-hard to compute subsets with these characteristics, whose cardinality approximates that of the minimum cardinality subset. Additionally, I propose new algorithms for computing such subsets, with tight approximation factors in general metric spaces, and improved factors under doubling metrics and Lp metrics with p≥2. Finally, I show an alternative implementation scheme that reduces the worst-case time complexity of one of these algorithms, becoming the first truly subquadratic approximation algorithm for the nearest-neighbor condensation problem. Moreover, experimental results confirm that the performance of these algorithms is comparable to existing state-of-the-art heuristics for condensation.

Building upon these results, there are some extensions and additional research avenues that I intend to pursue. First, improved analyses on the guarantees of existing algorithms, and extensions of the condensation criteria to consider the use of k nearest-neighbors, instead of one. Additionally, the advent of big data presents new challenges when working with large data sets, which presses to consider the problem of nearest-neighbor condensation in the streaming and online models.

Examining Committee:

My preliminary research addresses these shortcomings by proposing approximation-sensitive criteria for the nearest-neighbor condensation problem, along with the first practical condensation algorithms with provable performance guarantees. First, I characterize sufficient conditions to guarantee the correct classification of unlabeled points when using approximate nearest-neighbor queries on these subsets, which introduces the notion of coresets for classification with the nearest-neighbor rule. Moreover, I prove that it is NP-hard to compute subsets with these characteristics, whose cardinality approximates that of the minimum cardinality subset. Additionally, I propose new algorithms for computing such subsets, with tight approximation factors in general metric spaces, and improved factors under doubling metrics and Lp metrics with p≥2. Finally, I show an alternative implementation scheme that reduces the worst-case time complexity of one of these algorithms, becoming the first truly subquadratic approximation algorithm for the nearest-neighbor condensation problem. Moreover, experimental results confirm that the performance of these algorithms is comparable to existing state-of-the-art heuristics for condensation.

Building upon these results, there are some extensions and additional research avenues that I intend to pursue. First, improved analyses on the guarantees of existing algorithms, and extensions of the condensation criteria to consider the use of k nearest-neighbors, instead of one. Additionally, the advent of big data presents new challenges when working with large data sets, which presses to consider the problem of nearest-neighbor condensation in the streaming and online models.

Examining Committee:

Chair: Dr. David Mount

Dept rep: Dr. John Dickerson

Members: Dr. MohammadTaghi HajiAghayi

Dept rep: Dr. John Dickerson

Members: Dr. MohammadTaghi HajiAghayi

Bio

I'm a PhD student in Computer Science at UMD, working in Computational Geometry and related areas. I completed my undergrad studies at Universidad Simon Bolivar, Venezuela.

This talk is organized by Tom Hurst