Talks

PhD Proposal: Reducing False Positives of Static Code Analysis

Ugur Koc

Friday, May 11, 2018, 11:00 am-1:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

The large scale of the modern software systems and high complexity of the modern programming languages make perfectly precise static code analysis (SCA) infeasible. Consequently, SCA tools often make assumptions and approximations for scalability and practicality. Moreover, they sometimes over-approximate, so not to miss any real problems, -i.e., for being sound. These decisions, however, come at the expense of raising false alarms, which, in practice, reduces the usability of these tools.

For my thesis, I propose to study and develop machine learning-based approaches for reducing the false positives of SCA. In particular, the proposed research plan has following streams of work: construction and analysis of a ground-truth SCA finding dataset; design and development of code transformation techniques; study and development of neural network models that can learn from code for creating false positive classifiers; and evaluation of the proposed approaches and developed tools.

Construction and analysis of a ground-truth dataset aim at creating a database of labeled SCA findings and identifying and documenting the code patterns that cause state-of-the-art SCA tools to emit false positives so that we can design and develop the more effective data preparation and learning techniques targeted to isolate and learn such patterns. Such a ground-truth dataset will enable the evaluation of ML-based program analysis research and tools like our work. Existing SCA benchmark suites are not well-suited for neither of these purposes because they contain small numbers of tiny programs and are not representative of the actual practice. The resulting dataset will contain a large number of SCA findings for real-world programs and their labels representing the best understanding of the correctness of those results (i.e., ground-truth).

Next, design and development of code transformation aim at developing well-defined transformation routines for preparing the code for the learning step. First set of transformations will extract the subset of program's code-base that is relevant for a given SCA finding. The second set of transformations will project the reduced code onto a generic space, free of program-specific words so not to memorize them in training and overfit. And the last set of transformations will tokenize the code. After this step, the resulting processed code should contain the root causes of SCA findings with minimum irrelevant parts which would act like noise in learning.

Next, study and development neural networks aim at understanding and developing neural network-based language models that can learn from code for classifying SCA findings. In an initial case study of a highly-used Java SCA tool, we showed that a simple LSTM model can effectively classify false positives produced by the subject SCA tool. Furthermore, we believe neural network models specifically designed for learning from code can achieve even better performance.

Finally, we will evaluate the performance of proposed approaches and developed tools by conducting experiments on the ground-truth dataset.

Examining Committee:

Chair: Dr. Adam Porter
Co-chair: Dr. Jeffrey Foster
Dept. rep: Dr. Marine Carpuat
Members: Dr. Mayur Naik

This talk is organized by Tom Hurst