A/B Testing Best Practices
When I started this blog, my primary objective was less about teaching others A/B testing and more about clarifying my own thoughts on A/B testing. I had been running A/B tests for about a year, and I was starting to feel uncomfortable with some of the standard methodologies. It’s pretty common to use Student’s t-test to analyze A/B tests for example. One of the assumptions underlying that test is that the distributions are Gaussian. “What about A/B testing is Gaussian?”, I wondered. I knew there was a big difference between one-sided and two-sided tests, but I didn’t feel confident in my ability to choose the right one. And the multiple comparisons problem seemed to rear its ugly head at every turn: what was the best way to handle this?
At the same time, I saw coworkers and data scientists at other companies seeming to neglect what, even to an A/B testing neophyte, were clearly important procedures, like deciding on the sample size and test duration before beginning the test. I was not able to properly articulate why that was so important, and so I started writing out, in a statistically rigorous way, the what and why of A/B testing, and hence, “Adventures in Why” was born. Through this, and through much reading, thinking, and conducting dozens of statistical experiments, I developed a much greater confidence in my ability to plan and analyze A/B tests. This post is the culmination of many years of iteration, honing not just the statistical aspects of testing, but, just as importantly, the business aspects.
Make a Plan
The single most critical component of a successful test is a good plan. Virtually every problem I have encountered with A/B testing could have been addressed with the right plan. Often people seem hesitant to plan, or to adhere to a plan. I think this comes from not knowing what a good plan entails, or why a good plan is so important.
A good plan entails just three things:
What experiences are being provided?
Specifically, how many experiment groups are there (are we doing an A/B test, or an A/B/C test, or…?) What will the people in each group actually experience? Can we verify these experiences before actually starting the test? What about during or after the test?
What is the success criterion (singular)?
What about the KPI(s)? Multiple KPIs are fine (and encouraged even), but all stakeholders need to agree in advance. Otherwise we wind up with the scenario where the new experience is good for engagement but bad for revenue and people start disagreeing about what is more important. Again, I encourage people to have all the KPIs needed to thoroughly assess the impact of a change, but at the end of the day there needs to be a clear decision strategy that is agreed upon before running the test. Otherwise the test will be inconclusive because people will disagree about the right decision.
How long will we run the test?
Pro tip: “Well…until we get a statistically significant result” is not the correct answer. Look at the KPIs. What would be a meaningful lift to the business? What would be cause for concern? What kind of sample size is needed to detect that kind of lift, or rule out that kind of harm? Everyone wants to be able to detect a 0.1% lift in retention until they find out how long they would have to run the test. I like to provide three sample size recommendations that are reasonable on their own, and the corresponding test sensitivities and let my business partners decide (though the middle sample size is always chosen to be my personal recommendation).
And once we have settled on a sample size and test duration, resist the urge to deviate from it. If you start the test and one of the options is catastrophically worse than the other(s), by all means end the test, but the appropriate conclusion in that case is “this option was catastrophically worse”. Don’t bother calculating statistical significance, because those calculations assume the sample size was fixed in advance. A valid test is not worth the risk to business metrics!
If you have a plan that addresses the three points above, then every single one of your A/B tests will be a success, because…
We Learn Something from Every Experiment
Too many people, including data scientists, seem to think that if we “do not reject the null hypothesis” that somehow the test has failed. Thinking that the test is a success if we do reject the null hypothesis is just as bad!
The goal of a test is not to reject the null hypothesis, it’s to learn something!
Let’s look at an example. We’re doing an A/B test comparing two subject lines. We want to find out which one does a better job of getting people to open the email. Before starting the test, we decided to include 10,000 people: 5,000 will receive subject line A and 5,000 will receive subject line B. Of the subject line A recipients, 500 opened the email. Of the subject line B recipients, 550 opened. The observed open rates in each group are 10% and 11%, respectively. We observe that the open rate in the second group is 10% higher (in relative terms) that the open rate in group A. Using the score test, we compute a p-value of 0.103, which does not constitute statistically significant evidence of a difference in open rates, at the 0.05 threshold. It is conceivable, by this metric, that the observed difference is just an artifact of the way people were assigned to groups.
One can imagine the uncomfortable conversation between a data scientist and an executive:
Executive: What were the results of the A/B test?
Data Scientist: It looks like B is a bit better than A, but the
results were not statistically significant.
Executive: So what do we do?
Data Scientist: (shrugs) Run the test for longer I guess?
It doesn’t have to be this way.
Instead, it’s best to report a confidence interval on the treatment effect. While we observe a 10% relative lift in open rate in group B as compared to group A, a 95% confidence interval has endpoints -2% and 23%. Since the result is not statistically significant, this confidence interval includes 0, but it also includes negative numbers, meaning that we cannot rule out the possibility that B is worse than A, even though it seems like it is better. But—importantly!—we can rule out the possibility that it is meaningfully worse. It isn’t 5% worse, or 20% worse. There would be very little risk (and clear upside) in moving forward with B. Even though the result is not statistically significant, we can still be confident in our recommendation. From a business perspective, it likely would not make sense to spend more time on a test when we have such a clear picture of the risk/reward trade off.
It isn’t any better when results are statistically significant.
Now suppose we observed 575 opens in group B instead of 550. Now the observed open rate in group B is 11.5%, a 15% relative lift compared to group A. The p-value is now 0.015, comfortably below the 0.05 threshold needed for statistical significance. The data scientist enthusiastically recommends subject line B as the permanent replacement for A. Months pass before the inevitable conversation:
Executive: Hey, why is the open rate about the same as it was a
few months ago? I thought you said our new subject line was 15% better
than the old one, but our open rates are basically flat!
Data Scientist: Umm…maybe it’s the
winner’s curse!
Executive: (facepalm)
Had the data scientist computed a confidence interval, they would have found endpoints 2.7% and 29%. So even though the observed treatment effect is a 15% lift in open rates, and even though the result was statistically significant, it is possible the real lift is as low as 2.7%! Combined with the winner’s curse (when selecting for positive outcomes), we find the real effect will tend to be lower than what we observe. In other words, the observed treatment effect will tend to be biased upward.
Too many people focus on statistical significance, either interpreting its presence as permission to take the observed result at face value, or its absence as a failure. The better path is to focus on confidence intervals and what they tell us about the range of possibilities: the possible risks and the likely rewards.
Planning and Analyzing Tests
Even when people are on board with doing things the right way, they still need to be able to calculate minimum sample sizes and confidence intervals. Once upon a time I googled “A/B testing calculator” and went through every result on the first page. Some of them could analyze tests but not plan them, some could plan them but gave incorrect answers (the documentation on the website itself quoted different results than when plugging those same numbers into the calculator itself), none reported confidence intervals, but some did report confidence levels, which is what they called “one minus the p-value”, confusingly. Overall, if you picked a calculator at random, you probably were not going to be very successful.
So I made my own calculator.
You can use it to plan and analyze a common type of A/B test, where the KPI is a success rate: a number of successes out of a certain number of trials or opportunities to succeed. Pretty much whenever you have a KPI like an open rate that is a numerator-over-a-denominator, that’s a success rate. It is by far the most common type of A/B test I run. To be clear, I also often run A/B tests where the KPI does not fall into this paradigm (like average revenue per user), but the other online calculators don’t support this either.
It supports multiple KPIs, addressing the multiple comparisons problem automatically so users don’t even need to know about it. It uses the score test instead of Student’s t-test (no Gaussian assumption needed!), and reports two-sided p-values and confidence intervals (as it should in the majority of cases!). I think the FAQ section below the calculator addresses all the questions on how to use it, but feel free to contact me with any questions.