N=2000 k=4(V=4E-3) k=8(V=8E-3) k=10(V=1E-2)
P (H
0
) = 95%
\
P (H
0
) 0.962(0.008) 0.953(0.006) 0.953(0.006)
b
V 4.23E-3(0.55E-3) 7.76E-3(0.82E-3) 9.66E-3(0.98E-3)
P (H
0
) = 90%
\
P (H
0
) 0.909(0.012) 0.903(0.009) 0.903(0.008)
b
V 3.81E-3(0.31E-3) 7.42E-3(0.48E-3) 9.28E-3(0.58E-3)
P (H
0
) = 80%
\
P (H
0
) 0.798(0.015) 0.802(0.011) 0.802(0.011)
b
V 3.89E-3(0.18E-3) 7.83E-3(0.31E-3) 9.80E-3(0.37E-3)
P (H
0
) = 50%
\
P (H
0
) 0.487(0.020) 0.491(0.015) 0.492(0.014)
b
V 3.88E-3(0.11E-3) 7.77E-3(0.19E-3) 9.73E-3(0.23E-3)
Table 2: Mixture of H
1
and H
0
. N = 2,000.
N=200 k=4(V=4E-3) k=8(V=8E-3) k=10(V=1E-2)
P (H
0
) = 95%
\
P (H
0
) 0.965(0.019) 0.963(0.016) 0.962(0.016)
b
V 4.44E-3(1.04E-3) 8.67E-3(2.04E-3) 1.07E-2(0.25E-2)
P (H
0
) = 90%
\
P (H
0
) 0.925(0.034) 0.907(0.026) 0.908(0.025)
b
V 3.62E-3(0.75E-3) 6.80E-3(1.12E-3) 8.55E-3(1.45E-3)
P (H
0
) = 80%
\
P (H
0
) 0.869(0.042) 0.843(0.033) 0.835(0.033)
b
V 3.94E-3(0.69E-3) 7.40E-3(1.10)E-3 9.05E-3(1.33E-3)
P (H
0
) = 50%
\
P (H
0
) 0.594(0.067) 0.518(0.051) 0.506(0.047)
b
V 3.44E-3(0.37E-3) 6.42E-3(0.59E-3) 7.94E-3(0.71E-3)
Table 3: Mixture of H
1
and H
0
. N = 200.
P (H
1
) = p ranges from as much as 70% to less than 1%.
The ordering of those p for different metrics aligns well with
our perception of how frequently we believed a metric truly
moved. For example metrics like page loading time moved
much more often than user engagement metrics such as vis-
its per user. For most metrics p is below 20%. This is
because the scale of Bing experimentation allows us to test
more aggressively with ideas of low success rate. We also
used P(Flat) in the P-Assessment and only looked at met-
rics with P(Flat)<20% and found it very effective in con-
trolling FDR. Compared to other FDR method [1; 24], our
method is the first that takes advantages of metric specific
prior information.
6. CONCLUSION AND FUTURE WORKS
In this paper we proposed an objective Bayesian A/B test-
ing framework. This framework is applicable when hundreds
or thousands of historical experiment results are available,
which we hope will be soon common in this big data era.
An natural and important question is how to pick such a
set of historical experiments. In principle, when analyzing
a new experiment, we want to use only similar historical
experiments for prior learning. Similarity can be judged by
product area, feature team and other side information. How-
ever, if we put too many selecting criteria, we will eventually
face the problem of not having enough number of historical
experiments for an accurate prior estimation, similar to the
cold-start problem. One solution is to use the prior learned
from all the experiments as a baseline global prior so the
prior for a subtype of experiment is a weighted combination
of this global prior and the prior learned from the (possibly
small) restricted set of historical data. This can be done
via hierarchical Bayes, i.e. putting a prior on prior. Other
future works include using more general exponential family
for π(µ), and also using one group model as in [11].
References
[1] Benjamini, Y. and Hochberg, Y. [1995], ‘Controlling the false
discovery rate: a practical and powerful approach to multiple
testing’, J. R. Stat. Soc. Ser. B pp. 289–300.
[2] Berger, J. [2006], ‘The case for objective Bayesian analysis’, Bay-
esian Anal. (3), 385–402.
[3] Berger, J. O. and Bayarri, M. J. [2004], ‘The Interplay of Bay-
esian and Frequentist Analysis’, Stat. Sci. 19(1), 58–80.
[4] Berger, J. O. and Wolpert, R. L. [1988], The Likelihood Princi-
ple.
[5] Dempster, A. P., Laird, N. M. and Rubin, D. B. [1977], ‘Maxi-
mum likelihood from incomplete data via the EM algorithm’, J.
R. Stat. Soc. Ser. B 39(1), 1–38.
[6] Deng, A. and Hu, V. [2015], Diluted Treatment Effect Estimation
for Trigger Analysis in Online Controlled Experiments, in ‘Proc.
8th ACM Int. Conf. Web search data Min.’.
[7] Deng, A., Li, T. and Guo, Y. [2014], Statistical Inference in
Two-stage Online Controlled Experiments with Treatment Se-
lection and Validation, in ‘Proc. 23rd Int. Conf. World Wide
Web’, WWW ’14, pp. 609–618.
[8] Deng, A., Xu, Y., Kohavi, R. and Walker, T. [2013], Improving
the sensitivity of online controlled experiments by utilizing pre-
experiment data, in ‘Proc. 6th ACM Int. Conf. Web search data
Min.’, ACM, pp. 123–132.
[9] Efron, B. [1986], ‘Why isn’t everyone a Bayesian?’, Am. Stat.
40(1), 1–5.
[10] Efron, B. [2010], Large-scale inference: empirical Bayes meth-
ods for estimation, testing, and prediction, Vol. 1, Cambridge
University Press.
[11] Efron, B. [2011], ‘Tweedie’s formula and selection bias’, J. Am.
Stat. Assoc. 106(496), 1602–1614.
[12] Efron, B. [2013a], ‘A 250-year argument: belief, behavior, and
the bootstrap’, Bull. Am. Math. Soc. 50(1), 129–146.
[13] Efron, B. [2013b], Empirical Bayes modeling, computation, and
accuracy, Technical report.
[14] Efron, B. [2014], ‘Frequentist accuracy of Bayesian estimates’,
J. R. Stat. Soc. Ser. B .
[15] Gelman, A. and Hill, J. [2006], Data analysis using regres-
sion and multilevel/hierarchical models, Cambridge University
Press.
[16] Johnson, V. E. [2013], ‘Revised standards for statistical evi-
dence’, Proc. Natl. Acad. Sci. 110(48), 19313–19317.
[17] Kass, R. and Raftery, A. [1995], ‘Bayes factors’, J. Am. Stat.
Assoc. 90(430), 773–795.
[18] Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T.
and Xu, Y. [2012], ‘Trustworthy Online Controlled Experiments:
Five Puzzling Outcomes Explained’, Proc. 18th Conf. Knowl.
Discov. Data Min. .
[19] Kohavi, R., Deng, A., Frasca, B., Xu, Y., Walker, T. and
Pohlmann, N. [2013], ‘Online Controlled Experiments at Large
Scale’, Proc. 19th Conf. Knowl. Discov. Data Min. .
[20] Kohavi, R., Deng, A., Longbotham, R. and Xu, Y. [2014], Seven
rules of thumb for web site experimenters, in ‘Proc. 20th Conf.
Knowl. Discov. Data Min.’, KDD ’14, New York, USA, pp. 1857–
1866.
[21] Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. M.
[2009], ‘Controlled Experiments on the Web: survey and practi-
cal guide’, Data Min. Knowl. Discov. 18, 140–181.
[22] Kruschke, J. K. [2013], ‘Bayesian estimation supersedes the t
test.’, J. Exp. Psychol. Gen. 142(2), 573.
[23] Masson, M. E. J. [2011], ‘A tutorial on a practical Bayesian al-
ternative to null-hypothesis significance testing.’, Behav. Res.
Methods 43(3), 679–90.
[24] Muller, P., Parmigiani, G. and Rice, K. [2006], FDR and Bay-
esian multiple comparisons rules, in ‘8th World Meet. Bayesian
Stat.’, Vol. 0.
[25] Murphy, K. P. [2012], Machine learning: a probabilistic per-
spective, MIT press.
[26] Park, T. and Casella, G. [2008], ‘The bayesian lasso’, J. Am.
Stat. Assoc. 103(482), 681–686.
[27] Rouder, J. N. [2014], ‘Optional stopping: no problem for
Bayesians.’, Psychon. Bull. Rev. 21(March), 301–8.
[28] Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D. and
Iverson, G. [2009], ‘Bayesian t tests for accepting and rejecting
the null hypothesis.’, Psychon. Bull. Rev. 16(2), 225–37.
[29] Scott, J. and Berger, J. [2006], ‘An exploration of aspects of
Bayesian multiple testing’, J. Stat. Plan. Inference .
[30] Sellke, T., Bayarri, M. and Berger, J. [2001], ‘Calibration of ρ
values for testing precise null hypotheses’, Am. Stat. 55(1), 62–
71.
[31] Senn, S. [2008], ‘A note concerning a selection ”paradox” of
Dawid’s’, Am. Stat. Assoc. 62(3), 206–210.
[32] Student [1908], ‘The probable error of a mean’, Biometrika 6, 1–
25.
[33] Tang, D., Agarwal, A., O’Brien, D. and Meyer, M. [2010], ‘Over-
lapping Experiment Infrastructure: More, Better, Faster Exper-
imentation’, Proc. 16th Conf. Knowl. Discov. Data Min. .