Technical Report MSC-2015-16

TR#:MSC-2015-16
Class:MSC
Title: Code Similarity via Natural Language Descriptions
Authors: Meital Zilberstein (Ben Sinai)
Supervisors: Eran Yahav
PDFCurrently accessibly only within the Technion network
Abstract: Code similarity is a central challenge in many programming related applications, such as code search, automatic translation, and programming education. In this work, we present a novel approach for establishing the similarity of code fragments by computing textual similarity between their corresponding textual descriptions. In order to find textual descriptions of code fragment, we leverage collective knowledge captured in question-answering sites, blog posts and other sources. Because our notion of code similarity is based on similarity of corresponding textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. To support the text-based similarity function, we also apply static analysis on the code fragments themselves and use it as another measure for similarity.

To experiment with our approach, we implemented it using data obtained from the popular question-answering site, Stackoverflow , and used it to determine the similarity of 100,000 pairs of code fragments which are written in multiple programming languages. We developed a crowdsourcing system, Like2drops, that allows users to label the similarity of code fragments. We utilized these classifications to build a massive corpus of 6,500 labeled program pairs. Our results show that our technique is effective in determining similarity and relatedness, and presents more than 80% precision, recall and accuracy.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2015/MSC/MSC-2015-16), rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2015
To the main CS technical reports page

Computer science department, Technion
admin