Automatic Feature Generation for Predicting Program Properties

Uri Alon, M.Sc. Thesis Seminar
Thursday, 17.8.2017, 10:00
Taub 601
Prof. E. Yahav

We present a novel approach for automatic feature generation for predicting program properties. Our approach automatically produces features that can capture long-distance syntactic relationships between program elements. The features are purely syntactic, and the method is useful for any programming language. Inspired by Parse Tree Paths in Natural Language Processing (NLP), we generate features that capture relationships in an Abstract Syntax Tree (AST). We show that these features are general and can: (i) cover a number of different prediction tasks, (ii) drive two different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages. We evaluate our approach on the tasks of predicting variable names, method names, and types of expressions. We use the generated features to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that automatically generated features capture semantic similarities and produce better results than existing methods. By representing program elements using path features, we believe that our approach can be used in a variety of other machine learning tasks for programming languages, including different applications and different learning models.

Back to the index of events