What rankers can be statistically distinguished in multileaved comparisons?

Abstract: This paper presents findings from an empirical study of multileaved comparisons, an efficient online evaluation methodology, in a commercial Web service. The most important difference from the previous studies is the number of rankers involved in the online evaluation: we compared 30 rankers for around 90 days by multileaved comparisons. A relatively large number of rankers answered several questions that could not be addressed in the previous work due to a small number of rankers: How much ranking difference is required for rankers to be statistically distinguished? How many impressions are necessary for finding statistically significant differences for correlated rankers? How large difference in offline evaluation can predict significant differences in a multileaved comparison? We answer these questions with the results of the multileaved comparisons, and generalized some of the findings by simulation-based experiments.

What rankers can be statistically distinguished in multileaved comparisons?

Makoto P. Kato, Akiomi Nishida, Tomohiro Manabe, Sumio Fujita, Takehiro Yamamoto

Comments

Similar Papers