Programming with Millions of Examples

What we do


The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this project, we develop techniques for learning from such "big code" and leveraging the learned models for program analysis, program synthesis and reverse engineering. Along the way, we explore a range of semantic program representations (e.g., symbolic automata, tracelets, and numerical abstractions), different statistical models capturing regularities in a code base, as well as different models for similarity. To put the techniques to the test, we explore their applications to semantic code search, code completion and reverse engineering.

[Supported by an ERC grant]

Publications


Synthesis with Abstract Examples
Dana Drachsler Cohen, Sharon Shoham, and Eran Yahav.
CAV'17: Computer Aided Verification
Learning Disjunctions of Predicates
Nader Bshouty, Dana Drachsler Cohen, Martin Vechev, and Eran Yahav.
COLT'17: Conference On Learning Theory
Synthesis of Forgiving Data Extractors
Adi Omari, Sharon Shoham, and Eran Yahav.
WSDM'17: ACM Conference on Web Search and Data Mining
Similarity of Binaries through Re-optimization
Yaniv David, Nimrod Partush, and Eran Yahav.
PLDI'17: Programming Languages Design and Implementation
Leveraging a Corpus of Natural Language Descriptions for Program Similarity
Meital Zilberstein and Eran Yahav.
ONWARD'16: Symposium on New Ideas in Programming and Reflections on Software
[PDF][like2drops]
Extracting Code from Programming Tutorial Videos
Shir Yadid and Eran Yahav.
ONWARD'16: Symposium on New Ideas in Programming and Reflections on Software
[PDF][video]
Lossless Separation of Web Pages into Layout Code and Data
Adi Omari, Benny Kimelfeld, Sharon Shoham, and Eran Yahav.
KDD'16: ACM SIGKDD Conference on Knowledge Discovery and Data Mining
[PDF]
Cross-Supervised Synthesis of Web-Crawlers
Adi Omari, Sharon Shoham, and Eran Yahav.
ICSE'16: the 38th International Conference on Software Engineering
[PDF]
Statistical Similarity of Binaries
Yaniv David, Nimrod Partush, and Eran Yahav.
PLDI'16: Programming Languages Design and Implementation
[pdf] [TL;DR] [Esh ]
D3: Data-Driven Disjunctive Abstraction
Hila Peleg, Sharon Shoham, Eran Yahav
VMCAI'16: International Conference on Verification, Model Checking, and Abstract Interpretation
[pdf] [TL;DR]
Estimating Types in Binaries using Predictive Modeling
Omer Katz, Ran El-Yaniv, Eran Yahav
POPL'16: ACM SIGPLAN Conference on Principles of Programming Languages
[pdf] [TL;DR]
Abstract Semantic Differencing via Speculative Correlation
Nimrod Partush, Eran Yahav
OOPSLA'14: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications
[pdf] [TL;DR]
Tracelet-Based Code Search in Executables
Yaniv David, Eran Yahav
PLDI'14: ACM Conference on Programming Language Design and Implementation
[pdf] [slides] [code] [TL;DR]
Code Completion with Statistical Language Models
Veselin Raychev, Martin Vechev, Eran Yahav
PLDI'14: ACM Conference on Programming Language Design and Implementation
[pdf] [slides] [TL;DR]
Symbolic Automata for Specification Mining
Peleg H., Shoham S., Eran Yahav, Yang H.
SAS'13: The 20th International Static Analysis Symposium
[pdf] [slides] [TL;DR]
Typestate-Based Semantic Code Search over Partial Programs
Mishne A., Shoham S., Yahav E.
OOPSLA'12: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications
[pdf] [slides] [code] [TL;DR]

Brewing


On the Expressive Power of LSTMs
Automatic Feature Generation for Predicting Program Properties
Synthesis with a Granular Interaction Model

Talks


Programming with Millions of Examples

A relatively old talk at Zurich workshop, but covers some of the ideas at a high level.

Programming with Millions of Example

Talk at ETH Distinguished Colloquium, December 2014

Analysis and Synthesis with "Big Code"

Talk at Marktoberdorf Summer School 2015

Opportunities and Challenges in Program Simliarity

Talk at ML4PL workshop

Analysis and Synthesis with "Big Code"

Talk at ECOOP Summer School 2015

Abstract Semantic Differencing for Numerical Programs

Talk at VSSE'13

Software


PRIME

Basic Java Implementation of PRIME

DIZY

Program Differencing

TRACY

Code Search in Binaries

Esh

Statistical Similarity of Binaries

Like2Drops

Cross-Language Similarity

Contact

Computer Science Department
Technion, Israel

+972 48294318
yahave@cs.technion.ac.il

ERC banner