Abstract:
We propose a new unsupervised learning technique for extracting
information about authors and topics from large text collections.
We model documents as if they were generated by a two-stage
stochastic process. An author is represented by a probability
distribution over topics, and each topic is represented as a
probability distribution over words for that topic. The words in a
multi-author paper are assumed to be the result of a mixture of
each authors' topic mixture. The topic-word and author-topic
distributions are learned from data in an unsupervised manner
using a Markov chain Monte Carlo algorithm. We apply the
methodology to two large text corpora: 160,000 abstracts from
the CiteSeer digital library and 2000 papers from the Neural
Information Processing Systems Conference (NIPS). We discuss in
detail the interpretation of the results discovered by the system
including specific topic and author models, ranking of authors by
topic and topics by author, parsing of abstracts by topics and
authors and detection of unusual papers by specific authors.
Extensions to the model that allow generalizations of the notion
of an author are also discussed and illustrated.