Contingency Tables Part I: The Potential Outcomes Framework
(This is the first post in a planned series about the design and analysis of A/B tests. Subscribe to my newsletter at the bottom to be notified about future posts.)
Imagine we have run an A/B test like the one illustrated in Figure 1. A thousand people are exposed to some legacy experience, and one hundred of those people do something good, e.g. they buy something. Another thousand people see some new thing, and one hundred and thirty buy something. We can summarize these results in a contingency table which makes it easy to compare the two groups, and is especially helpful for analysis. But what do we need to analyze? It certainly seems like the new thing is better than the old thing! After all, you don’t need a PhD in statistics to know that 130 > 100.
I think there’s this resentment that sometimes builds up between data scientists and non-data scientists when looking at results like this. As if non-data scientists need permission to conclude that 130 > 100, and statistical significance is what gives them that permission. As if data scientists were some sort of statistics mafia who need to be “consulted” if you know what’s good for you.
I can assure you, 130 is greater than 100, and statistical significance doesn’t have anything to do with it. The problem is, this is the wrong comparison.
There were 2000 people in this test. What do we think would have happened if we had shown all of them the legacy experience? Of course, we can only speculate about this, because that’s not what happened. If the results of this A/B test are a reasonable proxy, we might think that around 10% of them (around 200) would have purchased. But it would be silly to pretend we know exactly what would have happened in that scenario. Nobody can do a perfect job of speculating.
What do we think would have happened if all 2000 had experienced the new thing? Again, we don’t know—we can’t know—exactly, but it seems reasonable to speculate that around 13% of them (around 260) would have purchased.
There are a lot of weasel words in the last two paragraphs, and that’s because we are speculating about two situations, neither of which actually happened. It’s like, a person can speculate about what would have happened if Hillary had won the election in 2016, or if Bernie had won the nomination that year. But anyone who says they know exactly what would have happened is mistaken.
Nevertheless, this is the comparison we actually care about: comparing the outcome where everyone had the legacy experience with the outcome where everyone has the new experience. Neither of these things actually happened, but this is the comparison we actually care about. And this is the problem that statistics tries to solve: the comparison we care about, we can’t make; and the comparison we can make, we don’t care about!
In causal inference, the two scenarios we care about are called potential outcomes, and comparing potential outcomes is essential for smart decision making. But it’s easy to see that we can only ever observe at most one of the potential outcomes. If we decide to show the new experience to everyone, we give up any hope of learning what would have happened had we shown the old experience to everyone. An A/B test is a hybrid: we give up on observing either of the potential outcomes, but in exchange we get some insight into what both potential outcomes might look like.
Here’s another complication: what if this A/B test isn’t a good proxy for what would have happened in those hypothetical situations? What if the sorts of people in the first group are just completely different than the sorts of people in the second group? Like let’s say it’s an ad for a remote control car, and the only people in the first group are people who hate fun, and the only people in the second group are children and engineers with disposable income? The fact that fewer people in the first group purchased says more about those people and less about the ad that was shown.
That’s why it is so important to verify “covariate balance”: checking that relevant demographical and behavioral patterns are comparable in the two groups. When we randomly assign people to groups, the law of large numbers guarantees all covariates will be approximately balanced, but because it’s random we can get some fluctuations. It’s always a good idea to check, and if there is a (small) imbalance, we can correct for it after the fact. Even better, if there are some factors we think are especially important to balance, like gender or economic status, we can use a process called stratification to make sure they’re balanced when we design the test.
Another problem that can arise is interaction between the two groups. For example, let’s say the first group doesn’t see an ad at all, and the second group does. Let’s say Alice is in the first group and Brian is in the second. Neither of them buy the remote control car. But had Alice seen the ad, not only would she have bought the car, she would have called her friend Brian and persuaded him to buy one too so they can race! In this scenario, what Brian does depends on what Alice experiences.
Humans being social creatures, it’s easy to imagine this happens pretty frequently, and researchers bend over backwards to avoid it. There just isn’t a great way of taking it into account. Instead, researchers try to separate people so that what one person experiences does not influence what another person does. This assumption is called the “stable unit treatment value assumption” or SUTVA, which is one of those terms that doesn’t begin to describe what it actually means, but is nonetheless an essential assumption in causal inference. (I wrote more about SUTVA here.)
Even when the covariates are balanced and SUTVA holds, because people are randomly assigned to groups, the observed results are still not a perfect proxy for what would have happened in the two scenarios we care about. Is a number that’s around 260 greater than a number that’s around 200? Yeah, probably, depending on what we mean by “around” and “probably”. Again, these are weasel words, and we need statistics to explain exactly what we know and don’t know about the two hypothetical scenarios we’re trying to compare.
Without a doubt, 130 > 100, but do we think that a number around 260 is greater than a number around 200? How much bigger? The p-value answers the first question: the lower the p-value, the more confident we are that the observed comparison is at least directionally correct. The smaller the p-value, the more confident we are that a number around 260 is in fact greater than a number around 200, even though we don’t know exactly what those two numbers are. How confident is confident enough? That depends on context, like whether someone’s life is at stake, but typically if the p-value is less than 0.05, we say the result is statistically significant, and that threshold defines what is “confident enough”.
(The proper interpretation of p-values is quite challenging because of the multiple comparisons problem. I created a number called the W statistic specifically for this reason. I could—and eventually will—fill an entire other post on this topic.)
The confidence interval answers the “how much bigger” question. Some numbers around 260 are quite a bit higher than some other numbers around 200. Some numbers around 260 are only slightly higher than some other numbers around 200. Even though this particular result is statistically significant, all that tells us is we think the new experience is in fact better than the old one. It could be slightly better, or way better, or anywhere in between.
The confidence interval gives us a plausible range of values that are consistent with what we have observed. In this case, there is quite a bit of uncertainty! Even though we observed the new experience to be 30% better than the legacy, that is only true for a haphazardly chosen half of the audience! We don’t know exactly how much better the new experience would have been, had we shown it to everyone, compared to the legacy experience, had we shown it to everyone, because neither of those things actually happened. But we can be reasonably confident that it would have been at least 1.79% better. And we can be confident that it wouldn’t have been more than 66.13% better. And yeah, it probably would have been around 30% better. But we just can’t know for sure, because that comparison we actually care about isn’t what actually happened.
So that’s why data scientists care about sample sizes, and p-values, and confidence intervals. Because we’re trying to compare two hypothetical scenarios which by definition cannot both actually occur. Instead, we create a situation in which neither of these scenarios occur, so that we can learn about both of them. It actually works pretty well! In fact, it works better than literally anything else the human race has ever come up with. But it still doesn’t work perfectly, and that’s why if you look at a test like this and conclude that 130 > 100, your data scientist friend is going to die a little inside.
Like this post? Check out the next in the series here.