Two contributions to CIKM 2015 are online now:
- Tom Kenter and Maarten de Rijke. Short text similarity with word embeddings. In CIKM 2015: 24th ACM Conference on Information and Knowledge Management. ACM, October 2015. Bibtex, PDF
@inproceedings{kenter-short-2015, author = {Kenter, Tom and de Rijke, Maarten}, booktitle = {CIKM 2015: 24th ACM Conference on Information and Knowledge Management}, date-added = {2015-07-03 21:25:41 +0000}, date-modified = {2015-07-17 10:56:14 +0000}, month = {October}, publisher = {ACM}, title = {Short text similarity with word embeddings}, year = {2015}}
Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length. We investigate whether determining short text similarity is possible using only semantic features—where by semantic we mean, pertaining to a representation of meaning—rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity. We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts. We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.
- Tom Kenter, Pim Huijnen, Melvin Wevers, and Maarten de Rijke. Ad hoc monitoring of vocabulary shifts over time. In CIKM 2015: 24th ACM Conference on Information and Knowledge Management. ACM, October 2015. Bibtex, PDF
@inproceedings{kenter-ad-hoc-2015, author = {Kenter, Tom and Huijnen, Pim and Wevers, Melvin and de Rijke, Maarten}, booktitle = {CIKM 2015: 24th ACM Conference on Information and Knowledge Management}, date-added = {2015-07-03 21:23:10 +0000}, date-modified = {2016-02-21 12:47:09 +0000}, month = {October}, publisher = {ACM}, title = {Ad hoc monitoring of vocabulary shifts over time}, year = {2015}}
Word meanings change over time. Detecting shifts in meaning for particular words has been the focus of much research recently. We address the complementary problem of monitoring shifts in vocabulary over time. That is, given a small seed set of words, we are interested in monitoring which terms are used over time to refer to the underlying concept denoted by the seed words. In this paper, we propose an algorithm for monitoring shifts in vocabulary over time, given a small set of seed terms. We use distributional semantic methods to infer a series of semantic spaces over time from a large body of time-stamped unstructured textual documents. We construct semantic networks of terms based on their representation in the semantic spaces and use graph-based measures to calculate saliency of terms. Based on the graph-based measures we produce ranked lists of terms that represent the concept underlying the initial seed terms over time as final output. As the task of monitoring shifting vocabularies over time for an ad hoc set of seed words is, to the best of our knowledge, a new one, we construct our own evaluation set. Our main contributions are the introduction of the task of ad hoc monitoring of vocabulary shifts over time, the description of an algorithm for tracking shifting vocabularies over time given a small set of seed words, and a systematic evaluation of results over a substantial period of time (over four decades). Additionally, we make our newly constructed evaluation set publicly available.