Maarten de Rijke

Information retrieval

Category: Publications (page 1 of 6)

The Birth of Collective Memories

David Graus, Daan Odijk and I just uploaded a manuscript to arXiv on “The Birth of Collective Memories: Analyzing Emerging Entities in Text Streams.”

We study how collective memories are formed online. We do so by tracking entities that emerge in public discourse, that is, in online text streams such as social media and news streams, before they are incorporated into Wikipedia, which, we argue, can be viewed as an online place for collective memory. By tracking how entities emerge in public discourse, i.e., the temporal patterns between their first mention in online text streams and subsequent incorporation into collective memory, we gain insights into how the collective remembrance process happens online. Specifically, we analyze nearly 80,000 entities as they emerge in online text streams before they are incorporated into Wikipedia. The online text streams we use for our analysis comprise of social media and news streams, and span over 579 million documents in a timespan of 18 months.

We discover two main emergence patterns: entities that emerge in a “bursty” fashion, i.e., that appear in public discourse without a precedent, blast into activity and transition into collective memory. Other entities display a “delayed” pattern, where they appear in public discourse, experience a period of inactivity, and then resurface before transitioning into our cultural collective memory.

Please see https://arxiv.org/abs/1701.04039 and let us know what you think.

Survey on query auto completion online

A survey on query auto completion in information retrieval written by Fei Cai and myself is online now:

  • Fei Cai and Maarten de Rijke. A survey of query auto completion in information retrieval. Foundations and Trends in Information Retrieval, 10(4):273-363, September 2016. Bibtex, PDF
    @article{cai-survey-2016,
    Author = {Cai, Fei and de Rijke, Maarten},
    Date-Added = {2016-01-27 11:52:34 +0000},
    Date-Modified = {2016-09-20 14:20:08 +0000},
    Journal = {Foundations and Trends in Information Retrieval},
    Month = {September},
    Number = {4},
    Pages = {273--363},
    Title = {A survey of query auto completion in information retrieval},
    Volume = {10},
    Year = {2016}}

In information retrieval, query auto completion, also known as type-ahead and auto-complete suggestion, refers to the following functionality: given a prefix consisting of a number of characters entered into a search box, the user interface proposes alternative ways of extending the prefix to a full query. Ranking query completions is a challenging task due to the limited length of prefixes entered by users, the large volume of possible query completions matching a prefix, and the broad range of possible search intents. In recent years, a large number of query auto completion approaches have been proposed that produce ranked lists of alternative query completions by mining query logs.

In this survey, we review work on query auto completion that has been published before 2016. We focus mainly on web search and provide a formal definition of the query auto completion problem. We describe two dominant families of approaches to the query auto completion problem, one based on heuristic models and the other based on learning to rank. We also identify dominant trends in published work on query auto completion, viz. the use of time-sensitive signals and the use of user-specific signals. We describe the datasets and metrics that are used to evaluate algorithms for query auto completion. We also devote a chapter to efficiency and a chapter to presentation and interaction aspects of query auto completion. We end by discussing related tasks as well as potential research directions to further the area.

Special issue on Neural Information Retrieval

The Information Retrieval Journal has put out a call for contributions to a special issue on neural information retrieval. Topics for this issue include the application of neural network models in IR tasks, including but not limited to:

  • Full text document retrieval, passage retrieval, question answering
  • Web search, paid search, searching social media, entity search
  • Learning to rank combined with neural network based representation learning
  • User and task modelling, personalized search and recommendations, diversity
  • Query formulation assistance, query recommendation, conversational search
  • Multimedia and cross-media retrieval

The deadline is October 15, 2016. Here’s a PDF with the call.

WWW 2016 papers online

We have three full papers at WWW this year. They are all online now:

  • Alexey Borisov, Pavel Serdyukov, and Maarten de Rijke. Using metafeatures to increase the effectiveness of latent semantic models in web search. In WWW 2016: 25th International World Wide Web Conference, pages 1081-1091. ACM, April 2016. Bibtex, PDF
    @inproceedings{borisov-using-2016,
    Author = {Borisov, Alexey and Serdyukov, Pavel and de Rijke, Maarten},
    Booktitle = {WWW 2016: 25th International World Wide Web Conference},
    Date-Added = {2015-12-15 21:51:14 +0000},
    Date-Modified = {2016-05-22 18:00:45 +0000},
    Month = {April},
    Pages = {1081--1091},
    Publisher = {ACM},
    Title = {Using metafeatures to increase the effectiveness of latent semantic models in web search},
    Year = {2016}}

In web search, latent semantic models have been proposed to bridge the lexical gap between queries and documents that is due to the fact that searchers and content creators often use different vocabularies and language styles to express the same concept. Modern search engines simply use the outputs of latent semantic models as features for a so-called global ranker. We argue that this is not optimal, because a single value output by a latent semantic model may be insufficient to describe all aspects of the model’s prediction, and thus some information captured by the model is not used effectively by the search engine.

To increase the effectiveness of latent semantic models in web search, we propose to create metafeatures—feature vectors that describe the structure of the model’s prediction for a given query-document pair—and pass them to the global ranker along with the models’ scores. We provide simple guidelines to represent the latent semantic model’s prediction with more than a single number, and illustrate these guidelines using several latent semantic models.

We test the impact of the proposed metafeatures on a web document ranking task using four latent semantic models. Our experiments show that (1) through the use of metafeatures, the performance of each individual latent semantic model can be improved by 10.2% and 4.2% in NDCG scores at truncation levels 1 and 10; and (2) through the use of metafeatures, the performance of a combination of latent semantic models can be improved by 7.6% and 3.8% in NDCG scores at truncation levels 1 and 10, respectively.

  • Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. A neural click model for web search. In WWW 2016: 25th International World Wide Web Conference, pages 531-541. ACM, April 2016. Bibtex, PDF
    @inproceedings{borisov-neural-2016,
    Author = {Borisov, Alexey and Markov, Ilya and de Rijke, Maarten and Serdyukov, Pavel},
    Booktitle = {WWW 2016: 25th International World Wide Web Conference},
    Date-Added = {2015-12-15 21:48:25 +0000},
    Date-Modified = {2016-05-22 18:00:26 +0000},
    Month = {April},
    Pages = {531--541},
    Publisher = {ACM},
    Title = {A neural click model for web search},
    Year = {2016}}

Understanding user browsing behavior in web search is key to improving web search effectiveness. Many click models have been proposed to explain or predict user clicks on search engine results. They are based on the probabilistic graphical model (PGM) framework, in which user behavior is represented as a sequence of observable and hidden events. The PGM framework provides a mathematically solid way to reason about a set of events given some information about other events. But the structure of the dependencies between the events has to be set manually. Different click models use different hand-crafted sets of dependencies.

We propose an alternative based on the idea of distributed representations: to represent the user’s information need and the information available to the user with a vector state. The components of the vector state are learned to represent concepts that are useful for modeling user behavior. And user behavior is modeled as a sequence of vector states associated with a query session: the vector state is initialized with a query, and then iteratively updated based on information about interactions with the search engine results. This approach allows us to directly understand user browsing behavior from click-through data, i.e., without the need for a predefined set of rules as is customary for PGM-based click models.

We illustrate our approach using a set of neural click models. Our experimental results show that the neural click model that uses the same training data as traditional PGM-based click models, has better performance on the click prediction task (i.e., predicting user click on search engine results) and the relevance prediction task (i.e., ranking documents by their relevance to a query). An analysis of the best performing neural click model shows that it learns similar concepts to those used in traditional click models, and that it also learns other concepts that cannot be designed manually.

  • Christophe Van Gysel, Maarten de Rijke, and Marcel Worring. Unsupervised, efficient and semantic expertise retrieval. In WWW 2016: 25th International World Wide Web Conference, pages 1069-1079. ACM, April 2016. Bibtex, PDF
    @inproceedings{vangysel-unsupervised-2016,
    Author = {Van Gysel, Christophe and de Rijke, Maarten and Worring, Marcel},
    Booktitle = {WWW 2016: 25th International World Wide Web Conference},
    Date-Added = {2015-12-15 21:52:52 +0000},
    Date-Modified = {2016-05-22 18:01:03 +0000},
    Month = {April},
    Pages = {1069--1079},
    Publisher = {ACM},
    Title = {Unsupervised, efficient and semantic expertise retrieval},
    Year = {2016}}

We introduce an unsupervised discriminative model for the task of retrieving experts in online document collections. We exclusively employ textual evidence and avoid explicit feature engineering by learning distributed word representations in an unsupervised way. We compare our model to state-of-the-art unsupervised statistical vector space and probabilistic generative approaches. Our proposed log-linear model achieves the retrieval performance levels of state-of-the-art document-centric methods with the low inference cost of so-called profile-centric approaches. It yields a statistically significant improved ranking over vector space and generative models in most cases, matching the performance of supervised methods on various benchmarks. That is, by using solely text we can do as well as methods that work with external evidence and/or relevance feedback. A contrastive analysis of rankings produced by discriminative and generative approaches shows that they have complementary strengths due to the ability of the unsupervised discriminative model to perform semantic matching.

WSDM 2016 paper on dynamic collective entity representations for entity ranking online

The following WSDM 2016 paper is online now:

  • David Graus, Manos Tsagkias, Wouter Weerkamp, Edgar Meij, and Maarten de Rijke. Learning dynamic collective entity representations for entity ranking. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 595-604. ACM, February 2016. Bibtex, PDF
    @inproceedings{graus-dynamic-2016,
    Author = {Graus, David and Tsagkias, Manos and Weerkamp, Wouter and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {WSDM 2016: The 9th International Conference on Web Search and Data Mining},
    Date-Added = {2015-10-12 18:42:35 +0000},
    Date-Modified = {2016-05-22 17:59:44 +0000},
    Month = {February},
    Pages = {595--604},
    Publisher = {ACM},
    Title = {Learning dynamic collective entity representations for entity ranking},
    Year = {2016}}

Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entity’s description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that are associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time and learn the best representation as time evolves so as to capture the dynamics of how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.

Three full papers accepted at WWW 2016

Good news. Three full papers were accepted at the 25th Word Wide Web Conference:

  • Alexey Borisov, Pavel Serdyukov and Maarten de Rijke: Using Metafeatures to Increase the Effectiveness of Latent Semantic Models in Web Search
  • Alexey Borisov, Ilya Markov, Maarten de Rijke and Pavel Serdyukov: A Distributed Representation Approach to Modeling User Browsing Behavior in Web Search
  • Christophe Van Gysel, Maarten de Rijke and Marcel Worring: Unsupervised, Efficient and Semantic Expertise Retrieval

NIPS workshop paper on sources of variability in large-scale machine learning experiments online

Our contribution to the NIPS LearningSys 2015 Workshop on Machine Leaning Systems by Damien Lefortier, Anthony Truchet and Maarten de Rijke is available online now:

  • Damien Lefortier, Anthony Truchet, and Maarten de Rijke. Sources of variability in large-scale machine learning systems. In NIPS LearningSys 2015 Workshop on Machine Learning Systems, December 2015. Bibtex, PDF
    @inproceedings{lefortier-sources-2015,
    Author = {Lefortier, Damien and Truchet, Anthony and de Rijke, Maarten},
    Booktitle = {NIPS LearningSys 2015 Workshop on Machine Learning Systems},
    Date-Added = {2015-11-03 15:49:55 +0000},
    Date-Modified = {2016-04-03 17:49:40 +0000},
    Month = {December},
    Title = {Sources of variability in large-scale machine learning systems},
    Year = {2015}}

We investigate sources of variability of a state-of-the-art distributed machine learning system for learning click and conversion prediction models for display advertising. We focus on three main sources of variability: asynchronous updates in the learning algorithm, downsampling of the data, and the non-deterministic order of examples received by each learning instance. We observe that some sources of variability can lead to significant differences between the models obtained and cause issues for, e.g., regression testing, debugging, and offline evaluation. We present effective solutions to stabilize the system and remove these sources of variability, thus fully solving the issues related to regression testing and to debugging. Moreover, we discuss potential limitations of this stabilization for drawing conclusions, in which case we may want to take the variability produced by the machine learning system into account in confidence intervals.

NIPS paper on Copeland Dueling Bandits online

Our NIPS 2015 paper “Copeland Dueling Bandits” by Masrour Zoghi, Zohar Karnin, Shimon Whiteson and Maarten de Rijke is available online now:

  • Masrour Zoghi, Shimon Whiteson, Zohar Karnin, and Maarten de Rijke. Copeland dueling bandits. In NIPS 2015, pages 307-315, December 2015. Bibtex, PDF
    @inproceedings{zoghi-copeland-2015,
    Author = {Zoghi, Masrour and Whiteson, Shimon and Karnin, Zohar and de Rijke, Maarten},
    Booktitle = {NIPS 2015},
    Date-Added = {2015-09-04 16:45:01 +0000},
    Date-Modified = {2017-02-18 15:22:54 +0000},
    Month = {December},
    Pages = {307--315},
    Title = {Copeland dueling bandits},
    Year = {2015}}

In the paper we address a version of the dueling bandit problem in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guar- anteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. Such existing results either offer bounds of the form O(K log T) but require restrictive assumptions, or offer bounds of the form O(K^2 log T) without requiring such assumptions. Our results offer the best of both worlds: O(K log T) bounds without restrictive assumptions.

WSDM 2016 paper on Multileave Gradient Descent for Fast Online Learning to Rank online

One of our WSDM 2016 papers is online now:

  • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. Bibtex, PDF
    @inproceedings{schuth-multileave-2016,
    Author = {Schuth, Anne and Oosterhuis, Harrie and Whiteson, Shimon and de Rijke, Maarten},
    Booktitle = {WSDM 2016: The 9th International Conference on Web Search and Data Mining},
    Date-Added = {2015-10-12 18:46:10 +0000},
    Date-Modified = {2016-05-22 17:59:17 +0000},
    Month = {February},
    Pages = {457--466},
    Publisher = {ACM},
    Title = {Multileave gradient descent for fast online learning to rank},
    Year = {2016}}

Modern search systems are based on dozens or even hundreds of ranking features. The dueling bandit gradient descent (DBGD) algorithm has been shown to effectively learn combinations of these features solely from user interactions. DBGD explores the search space by comparing a possibly improved ranker to the current production ranker. To this end, it uses interleaved comparison methods, which can infer with high sensitivity a preference between two rankings based only on interaction data. A limiting factor is that it can compare only to a single exploratory ranker.

We propose an online learning to rank algorithm called multileave gradient descent (MGD) that extends DBGD to learn from so-called multileaved comparison methods that can compare a set of rankings instead of merely a pair. We show experimentally that MGD allows for better selection of candidates than DBGD without the need for more comparisons involving users. An important implication of our results is that orders of magnitude less user interaction data is required to find good rankers when multileaved comparisons are used within online learning to rank. Hence, fewer users need to be exposed to possibly inferior rankers and our method allows search engines to adapt more quickly to changes in user preferences.

CIKM 2015 papers online

Two contributions to CIKM 2015 are online now:

  • Tom Kenter and Maarten de Rijke. Short text similarity with word embeddings. In CIKM 2015: 24th ACM Conference on Information and Knowledge Management. ACM, October 2015. Bibtex, PDF
    @inproceedings{kenter-short-2015,
    Author = {Kenter, Tom and de Rijke, Maarten},
    Booktitle = {CIKM 2015: 24th ACM Conference on Information and Knowledge Management},
    Date-Added = {2015-07-03 21:25:41 +0000},
    Date-Modified = {2015-07-17 10:56:14 +0000},
    Month = {October},
    Publisher = {ACM},
    Title = {Short text similarity with word embeddings},
    Year = {2015}}

Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length. We investigate whether determining short text similarity is possible using only semantic features—where by semantic we mean, pertaining to a representation of meaning—rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity. We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts. We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.

  • Tom Kenter, Pim Huijnen, Melvin Wevers, and Maarten de Rijke. Ad hoc monitoring of vocabulary shifts over time. In CIKM 2015: 24th ACM Conference on Information and Knowledge Management. ACM, October 2015. Bibtex, PDF
    @inproceedings{kenter-ad-hoc-2015,
    Author = {Kenter, Tom and Huijnen, Pim and Wevers, Melvin and de Rijke, Maarten},
    Booktitle = {CIKM 2015: 24th ACM Conference on Information and Knowledge Management},
    Date-Added = {2015-07-03 21:23:10 +0000},
    Date-Modified = {2016-02-21 12:47:09 +0000},
    Month = {October},
    Publisher = {ACM},
    Title = {Ad hoc monitoring of vocabulary shifts over time},
    Year = {2015}}

Word meanings change over time. Detecting shifts in meaning for particular words has been the focus of much research recently. We address the complementary problem of monitoring shifts in vocabulary over time. That is, given a small seed set of words, we are interested in monitoring which terms are used over time to refer to the underlying concept denoted by the seed words. In this paper, we propose an algorithm for monitoring shifts in vocabulary over time, given a small set of seed terms. We use distributional semantic methods to infer a series of semantic spaces over time from a large body of time-stamped unstructured textual documents. We construct semantic networks of terms based on their representation in the semantic spaces and use graph-based measures to calculate saliency of terms. Based on the graph-based measures we produce ranked lists of terms that represent the concept underlying the initial seed terms over time as final output. As the task of monitoring shifting vocabularies over time for an ad hoc set of seed words is, to the best of our knowledge, a new one, we construct our own evaluation set. Our main contributions are the introduction of the task of ad hoc monitoring of vocabulary shifts over time, the description of an algorithm for tracking shifting vocabularies over time given a small set of seed words, and a systematic evaluation of results over a substantial period of time (over four decades). Additionally, we make our newly constructed evaluation set publicly available.

Older posts

© 2017 Maarten de Rijke

Theme by Anders NorenUp ↑