Shahar Harel, M.Sc. Thesis Seminar
Designing a new drug is an expensive and lengthy process. The first stage is drug discovery, in which potential drugs are identified before selecting a candidate drug to progress to clinical trials. As the space of potential molecules is very large (10^23-10^60), a common technique during drug discovery is to start from a molecule which already has some of the desired properties. An interdisciplinary team of scientists generates hypothesis about the required changes to the prototype. We call this process a prototype-driven hypothesis generation.
In this talk, we present an algorithmic unsupervised approach for prototype-driven hypothesis generation. Our method is inspired by the known analogy between a chemist understanding of a compound and a language speaker understanding of a word (“Atoms are letters, molecules are the words, supramolecular entities are the sentences and the chapters” [Jean-Marie Lehn 1995]), which motivates the potential of Natural Language Processing for Computational Chemistry.
More formally, we design a conditional deep generative model for molecule generation with diversity attention. The model operates on a given molecule prototype and generates various molecules as candidates. The generated molecules should be novel and share desired properties with the prototype.
Our model extends Variational Autoencoders to allow a conditional diverse sampling - sampling an example from the data distribution (drug-like molecules) which is closer to a given input. This allows sampling molecules closer to a prototype drug, and thus increase probability of generating a valid drug with similar characteristics. Additionally, we add a diversity component that introduce parametrized diversity into the generation process, to allow the sampling to generate novelty with respect to the prototype.
We show that the molecules generated by the system are valid molecules which simultaneously have strong connection to the prototype and are novel. In addition, we suggest several ranking functions for the generated molecule population.
Out of the compounds generated by the system, we identified 35 FDA-approved drugs. As an example, our system generated Isoniazid - one of the main drugs for Tuberculosis.