Electrical Eng. Building 1061
In the era of “Big Code”, research is being conducted into automating the understanding of computer programs. Most of the current works borrow techniques from natural language processing and deep learning, which have been successful recently, attempting to process the code directly or using syntactic representations (e.g., ASTs and AST paths). However, to comprehend program semantics robustly, structural features of code have to be taken into account as well, including function calls, branching, and interchangeable order of statements. In this talk, I will present a novel processing technique to learn code semantics, and show how it applies to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that even without fine-tuning, a single Recurrent Neural Network (RNN) architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.