, M.Sc. Thesis Seminar
Named-entity Recognition (NER) is an important task in the NLP field, and is widely used to solve many challenges.
However, in many scenarios, not all of the entities are explicitly mentioned in the text. Sometimes they could be inferred from the context or from other indicative words. Consider the following sentence:
CMA can easily hydrolyze into free acetic acid. Although water is not mentioned explicitly, one can infer that H2O is an entity involved in the process.
In this work, we present the problem of Latent Entities Extraction (LEE).
We present several methods for determining whether entities are discussed in a text, even though, potentially, they are not explicitly written. Specifically, we design a neural model that handles extraction of multiple entities jointly. We show that our model, along with multi-task learning approach and a novel task grouping algorithm, reaches high performance in identifying latent entities.
Moreover, we propose an additional novel neural architecture for LEE which leverages a context conditioned autoencoder for classification. Once the model is trained, we utilize the benefit of it as a generative model to produce the classification with a multiple sampling technique. We show that our model scales well as the number of entities grows.
Our experiments are conducted on two datasets: (1) A large biological dataset from the biochemical field. The dataset contains text descriptions of biological processes, and for each process, all of the involved entities in the process are labeled, including implicitly mentioned ones. (2) A new dataset that we construct on top of twitter data, which is designed to conform with the settings of the latent entities extraction task.
We believe LEE is a task that will significantly improve many NER and subsequent applications and improve text understanding and inference.