Evaluation has always been the cornerstone of scientific development. Scientists come up with hypotheses (models) to explain physical phenomena, and validate these models by comparing their output to observations in nature. A scientific field consists then merely by a collection of hypotheses that could not been disproved (yet) when compared to nature. Evaluation plays the exact key role in the field of information retrieval. Researchers and practitioners develop models to explain the relation between an information need expressed by a person and information contained in available resources, and test these models by comparing their outcomes to collections of observations.
This article is a short survey on methods, measures, and designs used in the field of Information Retrieval to evaluate the quality of search algorithms (aka the implementation of a model) against collections of observations. The phrase “search quality” has more than one interpreta- tions, however here I will only discuss one of these interpretations, the effectiveness of a search algorithm to find the information requested by a user. There are two types of collections of observations used for the purpose of evaluation: (a) relevance annotations, and (b) observable user behaviour. I will call the evaluation framework based on the former a collection-based evaluation, while the one based on the latter an in-situ evaluation. This survey is far from complete; it only presents my personal viewpoint on the recent developments in the field.