Our full paper “Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity” got accepted at ECIR2017

Our full paper got accepted at ECIR2017:

Hosein Azarbonyad, Mostafa Dehghani, Tom Kenter, Maarten Marx, Jaap Kamps, and Maarten de Rijke, “Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity”.

Here is a short summary of this paper:

Quantitative notions of topical diversity in text documents are useful in several contexts, e.g., to assess the interdisciplinarity of a research proposal or to determine the interestingness of a document. An influential formalization of diversity has been introduced in biology (Rao’s diversity measure). It decomposes diversity in terms of elements that belong to categories within a population and formalizes the diversity of a population as the expected distance between two randomly selected elements of the population. This biological notion of diversity has been used to quantify the topical diversity of a text document. To this end, words are considered elements, topics are categories, and a document is a population. When using topic modeling for measuring topical diversity of text documents,  elements, categories, and populations are modeled based on the probability of a words in documents, based on the probability of words in topics, and based on the probability of topics in documents, consequently.

In probabilistic topic modeling, at estimation time, these distributions are usually assumed to be sparse. First, the content of a document is assumed to be generated by a small subset of words from the vocabulary. Second, each topic is assumed to contain only some topic-specific related words. Finally, each document is assumed to deal with a few topics only. When approximated using currently available methods they are often dense rather than sparse. Dense distributions cause two important problems for the quality of topic models when they are used for measuring topical diversity: generality and impurity. General topics mostly contain general words and are typically assigned to most of the documents in a corpus. Impure topics contain words that are not related to the topic.

In this research, we propose a hierarchical re-estimation process for making the distributions more sparse. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. For the re-estimation we use the concept of parsimony to extract only essential parameters of each distribution. The results show that the proposed topic model re-estimation approach no only outperforms the state-of-the-art approach in topical diversity task, but also helps in classifying and clustering documents based on topic models more accurately.