THE DISSERTATION DEFENSE FOR THE DEGREE OF Ph.D. IN COMPUTER SCIENCE FOR
Justin McCann
Diagnosing performance degradation in distributed systems is a complex and difficult task. Software that performs well in one environment may be unusably slow in another, and determining the root cause is time-consuming and error-prone, even in environments in which all the data may be available. End users have an even more difficult time trying to diagnose system performance, since both software and network problems have the same symptom: a stalled application.
The central thesis of this dissertation is that the source of performance stalls in a distributed system can be automatically detected and diagnosed with very limited information: the dependency graph of data flows through the system, and a few counters common to almost all data processing systems. Our automated fault detection system requires as little as two bits of information per module: one to indicate whether the module is actively processing data, and one to indicate whether the module is waiting on its dependents. We prove this thesis by implementing the idea and demonstrating its effectiveness in two distinct environments: an individual host's networking stack, and a distributed streams processing system. Using real applications, we show that our approach correctly diagnoses 99% of networking-related stalls due to application, connection-specific, or network-wide performance problems, with a false positive rate under 3%. Our prototype system for diagnosing messaging stalls in a commercial streams processing system correctly finds 93% of message-processing stalls, with a false positive rate of 2%.
Examining Committee:
Committee Chair: Dr. Michael Hicks
Dean’s Representative: Dr. Mark Shayman
Committee Members: Dr. Peter Keleher
Dr. James Reggia
Dr. Neil Spring
EVERYONE IS INVITED TO ATTEND THE PRESENTATIVE PORTION OF THIS DEFENSE