Technical Report PHD-2019-03

Title: Statistical Approaches to Reverse Engineering
Authors: Omer Katz
Supervisors: Eran Yahav
PDFCurrently accessibly only within the Technion network
Abstract: Software affects all aspects of our lives. More and more industries and products incorporate software as a core component, and we routinely place our trust in software. If that software fails, either due to an attack or an error, it can have major implications for users, industries, and society. As the impact of software grows rapidly, it is imperative that we raise the level of confidence in its correctness.

One way to make sure that software is free from vulnerabilities and back-doors is to understand how it works, for example by inspecting its source code. However, the vast majority of software is delivered in binary form, without source code. When source code is not available, reverse engineering is required to lift software in low-level representation (typically machine code) to higher-level presentation that can be understood by humans. Reverse engineering is notoriously hard and requires extensive manual effort by experts.

This thesis focuses on techniques that assist reverse engineering. We tackle several reverse engineering challenges and provide statistical solutions. The problems we address in this thesis were left unsolved for many years. Some advances have been made but a solution that can be applied to realistic software remained unachievable, often due to computational complexity. The main idea behind our statistical solutions is to trade off precision for computational feasibility. These solutions are effective in focusing and directing reverse engineering efforts, and can be directly used for improving the level of software security.

We start by discussing the problem of identifying targets of virtual function calls in optimized stripped binaries. This is a crucial step towards understanding the control flow of the binary. It is crucial both for manual reverse engineering and for automatic analysis. Our solution relies on object tracelets --- sequences of operations applied to an object --- which represent how an instance of a type is used in practice, and the subtle differences between usages of different types.

We then tackle the problem of reconstructing type hierarchies in optimized stripped binaries. We measure a similarity between types based on similarity of usage patterns. We then elevate that similarity to a full hierarchy. Such a hierarchy is needed for understanding a given binary. Furthermore, it is essential for securing existing binaries with mechanisms such as control flow integrity.

Finally, we examine the challenge of automatic decompilation. This challenge is the pinnacle of reverse engineering. Being able to automatically decompile low-level code back to human-readable equivalent code makes it accessible to a wider range of programmers. It also allows modification and recompilation of the code as needed. We present a new approach and a new point-of-view for this problem by framing it as a language translation problem.

Throughout this thesis, we borrow insights and methods from machine learning and natural language processing. We combine those with techniques from the field of programming languages, resulting in solutions much more powerful than achievable by each field separately.

All of the approaches specified in this thesis were implemented as prototypes and evaluated using real-world binaries and code. Our evaluation shows the benefits of each approach compared to traditional solutions, and its effectiveness in understanding and securing software.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2019
To the main CS technical reports page

Computer science department, Technion