Talk Title

Scalable, Accurate, Robust Binary Analysis with Transfer Learning

Talk Abstract

Binary program analysis is a fundamental building block for a broad spectrum of security tasks. Essentially, binary analysis encapsulates a diverse set of tasks that aim to understand and analyze the behaviors/semantics of binary programs. Existing approaches often tackle each analysis task independently and heavily employ ad-hoc task-specific brittle heuristics. While recent ML-based approaches have shown some early promise, they too tend to learn spurious features and overfit to specific tasks without understanding the underlying program semantics.

In this talk, I will describe some of our recent projects that use transfer learning on both binary code and execution traces to learn binary program semantics and transfer the learned knowledge for different binary analysis tasks. Our key observation is that by designing pretraining tasks that can learn binary code semantics, we can drastically boost the performance of binary analysis tasks. Our pretraining tasks are fully self-supervised -- they do not need expensive labeling effort and therefore can easily generalize across different architectures, operating systems, compilers, optimizations, and obfuscations. Extensive experiments show that our approach drastically improves the performance of popular tasks like binary disassembly, matching semantically similar binary functions, and recovering types from binary.


Suman Jana is an associate professor in the department of computer science and the data science institute at Columbia University. His primary research interest is at the intersections of computer security and machine learning. His research has received six best paper awards, a CACM research highlight, a Google faculty fellowship, a JPMorgan Chase Faculty Research Award, an NSF CAREER award, and an ARO young investigator award.