Power Analysis for Interleaving Experiments by Means of Offline Evaluation

Evaluation is key to building effective, efficient, and usable information retrieval (IR) systems as it enables their effectiveness to be quantified. For decades the primary approach to conduct IR evaluation was the Cranfield approach, that makes use of test collections, allowing systematic and repeatable evaluations to be carried out in a controlled manner. The Cranfield approach has often been critisized for not quantifying the actual user experience when served by the search algorithm under testing. Online evaluation on the other hand — that is A/B testing and Interleaving – is performed on the basis of user interactions with the ranked list of results produced by the algorithms being tested.

Designing an experiment that enables the discovery of statistical significant effects is of paramount importance in both the online and the offline setting. Power analysis allows to determine the sample size (i.e. the number of measurements collected) required to detect an effect of a given size with a given degree of confidence.

Conducting a power analysis however requires setting the effect size to be detected prior to the experiment. This often leads to suboptimal results: the actual effect size observed after having run the experiment is larger than the one specified, and therefore required fewer measurements than the ones dictated by the power analysis.

In this research we make use of the funnel approach often used in information retrieval evaluation, according to which, an experimental algorithm is compared to a baseline/production algorithm in an online experiment only if it outperforms the production algorithm in an offline experiment. The question we answer in this work is whether the evaluation results of the offline experiment can inform the number of impressions required, and hence the duration, of an online interleaving experiment. In other words, given that two systems have ΔM = α, with M being any evaluation measure in an offline experiment, can we detect the number of impressions required for an interleaving experiment to conclude that the two systems are statistically significantly different?

Our main findings are: (1) there is a strong correlation between the margin of the quality difference in offline evaluation and the required number of impressions in online evaluation — something expected — , but also (2) there is a large variability in the number of required expressions for the same margin.This indicates that the results of an offline experiment can inform the design of an online experiment, and in particular the power of the experiment to evaluate a new experimental system, it is beneficial to first conduct an offline evaluation and then based on the results achieved in offline evaluation, an estimate about the duration of the online experiment can be made.

For more information take a look at the following paper:

Hosein Azarbonyad and Evangelos Kanoulas, “Power Analysis for Interleaving Experiments by Means of Offline Evaluation”, In Proceedings of 2nd ACM SIGIR  International Conference on the Theory of Information Retrieval (ICTIR’16), 2016.