log in  |  register  |  feedback?  |  help  |  web accessibility
Scaling Binary Corpus Analysis: Assemblage and Reversi
virtual: https://umd.zoom.us/j/93825468763?pwd=UXlYZmVkVndXb1owMkpYb2tOQjZFQT09
Monday, March 28, 2022, 12:30-1:30 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)

Binary analysis, decompilation, and reverse engineering are important topics relevant to security. Until we achieve fully-abstract compilation--there will always be a need to employ some level of sophistication to understand binaries as more than their source counterparts. Unfortunately, binary analysis and decompilation feel stuck in the stone age--production tools (IDA Pro, Ghidra) are often huge black boxes that allow little introspection into their workings and inability for a user to control the tool's decisions in a general way.

In this talk we will discuss two pieces of work: Assemblage and Reversi. Assemblage is a binary corpus construction, reproduction, and archival system. Assemblages are files containing metadata that can be used by a running Assemblage instance to download, configure, and build large binary corpuses from their source code. We will discuss our current implementation of Assemblage, which is running on 12 (4-8 core) nodes in the Syracuse University Research Computing Cluster. Assemblage currently builds ~30k Windows binaries per day, and performs indexing to relate input (source) and output (binary) code and artifacts. We discuss our imminent work in using Assemblage to generate binary corpuses for training machine learning systems for binary analysis (e.g., malware classifiers, neural variable identification).

We will also discuss our in-progress work Reversi--a Datalog-based binary instrumentation system. Reversi uses Datalog to achieve extremely high-fidelity disassembly of arbitrary binaries, and allows writing declarative rules to perform binary instrumentation based on the results of program analyses used to derive the disassembly. We present initial results in using Reversi to replace AFL's (fuzzer) LLVM-based instrumentation. We show equivalent performance to AFL's rewriting, and strongly outperform AFL's QEMU-based binary-only instrumentation.

This talk is organized by David Van Horn