Technical Report MSC-2021-18

Title: Extracting Bible Quotes from Historical Commentary
Authors: Asaf Yeshurun
Supervisors: Benny Kimelfeld
Abstract: The Hebrew Bible (Tanach) has been extensively quoted by historical religious text and commentaries throughout history. Nowadays, many of these text resources are publicly available online. Yet, the Bible quotations within them are often partially identified if at all. Knowing the exact quotations may be highly beneficial to scholars interested in studying or investigating the Bible. We have developed and empirically analyzed several solutions to this task, utilizing both rule-based heuristics and machine-learning. End-to-end, our main model is comprised of three main stages: (a) rule-based candidate generation, (b) context extraction using available historical commentary, and (c) an artificial neural network for candidate scoring. To evaluate our models, we have constructed labeled data based on the Hebrew Bible commentary known as Midrash Raba, which contains more than half a million words and over 30,000 quotations. Our solution scores over 80\% F-score, and considerably outperforms several state-of-the-art approaches for tasks of a similar nature. In addition, it scores well when tested on unfamiliar corpora in case they involve writing style and vocabulary prevalent in Midrash Raba. as a contribution of independent interest, our solution includes of a novel word-embedding method that seeks to utilize the nature of our text and its context.
