Programming with Millions of Examples

What we do


The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this project, we develop techniques for leveraging such "big code" for program analysis, program synthesis and reverse engineering. Along the way, we explore a range of semantic program representations based on symbolic automata, tracelets and numerical abstractions as well as different notions of code similarity based on these representations. To put the techniques to the test, we explore their applications to semantic code search in both source code and stripped binaries, code completion and reverse engineering.

Publications


Statistical Similarity of Binaries
Yaniv David, Nimrod Partush, and Eran Yahav.
PLDI'16: Programming Languages Design and Implementation
[pdf] [TL;DR] [Esh ]
D3: Data-Driven Disjunctive Abstraction
Hila Peleg, Sharon Shoham, Eran Yahav
VMCAI'16: International Conference on Verification, Model Checking, and Abstract Interpretation
[pdf] [TL;DR]
Estimating Types in Binaries using Predictive Modeling
Omer Katz, Ran El-Yaniv, Eran Yahav
POPL'16: ACM SIGPLAN Conference on Principles of Programming Languages
[pdf] [TL;DR]
Abstract Semantic Differencing via Speculative Correlation
Nimrod Partush, Eran Yahav
OOPSLA'14: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications
[pdf] [TL;DR]
Tracelet-Based Code Search in Executables
Yaniv David, Eran Yahav
PLDI'14: ACM Conference on Programming Language Design and Implementation
[pdf] [slides] [code] [TL;DR]
Code Completion with Statistical Language Models (led by ETH team)
Veselin Raychev, Martin Vechev, Eran Yahav
PLDI'14: ACM Conference on Programming Language Design and Implementation
[pdf] [slides] [TL;DR]
Symbolic Automata for Specification Mining
Peleg H., Shoham S., Eran Yahav, Yang H.
SAS'13: The 20th International Static Analysis Symposium
[pdf] [slides] [TL;DR]
Typestate-Based Semantic Code Search over Partial Programs
Mishne A., Shoham S., Yahav E.
OOPSLA'12: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications
[pdf] [slides] [code] [TL;DR]

Brewing


Leveraging Big Code for Program Similarity
Meital Ben-Sinai and Eran Yahav.
try like2drops
Extracting Code from Programming Tutorial Videos
Shir Yadid and Eran Yahav.
Statistical Similarity of Binaries at Scale
Yaniv David, Nimrod Partush, and Eran Yahav.

Talks


Programming with Millions of Examples

A relatively old talk at Zurich workshop, but covers some of the ideas at a high level.

Programming with Millions of Example

Talk at ETH Distinguished Colloquium, December 2014

Analysis and Synthesis with "Big Code"

Talk at Marktoberdorf Summer School 2015

Opportunities and Challenges in Program Simliarity

Talk at ML4PL workshop

Analysis and Synthesis with "Big Code"

Talk at ECOOP Summer School 2015

Abstract Semantic Differencing for Numerical Programs

Talk at VSSE'13

Software


PRIME

Basic Java Implementation of PRIME

DIZY

Program Differencing

TRACY

Code Search in Binaries

Esh

Statistical Similarity of Binaries

Like2Drops

Cross-Language Similarity

Contact

Computer Science Department
Technion, Israel

+972 48294318
yahave@cs.technion.ac.il

ERC banner