Bayesian A/B Testing Considered Harmful

Recently I’ve been studying Theoretical Statistics by Cox and Hinkley. The book rewards careful study. It’s full of gems I had missed when reading it casually. One particularly interesting chapter discusses certain principles or properties that are often desirable in a statistical analysis. In simple examples, we can find procedures that satisfy all of these properties, but there are examples where we need to think about what properties are most important to the case at hand.

This discussion clarified the conflict between Bayesian and Frequentist methods. While I find myself most often using Frequentist methods, I certainly don’t have anything against Bayesian methods, and I have used Bayesian methods successfully in some projects. I’m especially impressed with Bayesian hierarchical models, which seem like the logical conclusion of regularization-based approaches. But I had come away with the impression that Bayesian methods with an uninformative prior typically give the same answer as Frequentist methods, so I didn’t see much of a difference between them.

Perhaps the cornerstone of the Frequentist philosophy is the Strong Repeated Sampling Principle, which Cox and Hinkley describe as the importance of assessing statistical procedures by their performance under hypothetical repetitions. For reasons I will explain shortly, I associate this principle with the notion of repeatability. It’s not something I would give up lightly.

Another principle discussed is the Strong Likelihood Principle, which states that if two random systems have proportional likelihood functions, we should draw the same conclusions from both systems. Cox and Hinkley give the example of a sequence of Bernoulli trials (with which I am all too familiar). One random system samples a pre-determined $n$ points and observes $r$ successes. A second system samples points one at a time, continuing until $r$ successes are observed. Suppose it takes $n$ trials to observe these successes. The first system is described by the binomial distribution; the second by the negative binomial distribution. The corresponding likelihood functions are proportional, and so the Strong Likelihood Principle states that we should draw the same conclusions about the probability of success regardless of which sampling mechanism is used.

Before discussing the relationship between the Strong Likelihood Principle and the Strong Repeated Sampling Principle, I’ll mention the Bayesian Coherency Principle. Cox and Hinkley describe this as requiring that probability statements “should be such as to ensure self-consistent betting behavior”. This enforces a kind of internal consistency in any conclusions, and that certainly seems desirable. Cox and Hinkley do not go into detail, but argue that the Bayesian Coherency Principle requires we also accept the Strong Likelihood Principle.

Considered individually, the Repeated Sampling and Bayesian Coherency Principles both seem reasonable and desirable, but it turns out there are certain situations where we can’t have both. Consider an A/B test with early peeking. We have decided on a sample size to get a certain power against a certain alternative hypothesis of interest, and we start collecting observations. But half-way through we get curious and check the data. There is a statistically significant result, so we stop the test.

This is a pretty common occurrence, and IT IS BAD. Do not do this! Checking the data in the middle of the test can be thought of as a second comparison that would need to be corrected for, e.g. using the Bonferroni correction. But this “early peeking” means that the duration of the test depends on the data in the test, and so the duration of the test itself is random (since the data are random). The typical formulae we use to calculate p values and confidence intervals assume that the sample size is fixed, not random. If we use the typical formulae, early peeking has the result of inflating our Type-I error rate: we get the wrong answer more often than we planned for.

(There are other procedures, known as sequential tests, that allow as much early peeking as you want. I’ve never been able to find easy-to-understand procedures for calculating confidence intervals for sequential tests, and in my view, confidence intervals are the most valuable thing to come out of a test. So until I figure out how to calculate confidence intervals for a sequential test, I have no intention of using them.)

If you want to stop a test as soon as there is sufficient evidence to support that, go for it! Just use a sequential test! If you want to use the simple formulae, just don’t stop the test early! Or if you’re in a dire situation and you really need to end a test early, by all means do so, but note that your statistical calculations may be a bit off. This is largely a question of understanding the assumptions that go along with the methods we’re using, and using the right tool for the job.

However everything I just wrote goes against the Strong Likelihood Principle, and therefore against the Bayesian Coherency Principle. Since the likelihood function of a test with a fixed and pre-determined sample size is proportional to that of a test with potential early stopping, the Strong Likelihood Principle states that we should draw the same conclusions from either test.

I had read about this years ago and concluded that while this simplicity is attractive, there’s no such thing as a free lunch, and so we must be giving something up; I just didn’t know what that was. It turns out it’s the Strong Repeated Sampling Principle.

Bayesian books and papers like to dismiss the “hypothetical repetitions” mentioned in this princple as irrelevant to a data analysis, saying that conclusions should be based on the data at hand, not some other data that might hypothetically have been observed. However, if I am estimating some physically meaningful quantity, I’d really like to believe that some other data scientist operating independently from me would draw the same conclusions that I would draw, at least with high probability. So if I’m estimating the difference in conversion rates between two ads, I’d really like the confidence interval I calculate to pretty closely overlap with the confidence interval another data scientist would calculate. If I estimate that one ad has a conversion rate that is 10% higher than the other, I would hope another data scientist would draw basically the same conclusion. I don’t think of this as just a hypothetical repetition, I think of this as a completely essential, non-negotiable requirement if what I’m doing is to be considered scientific.

If my colleague and I run the test the same way, this requirement is satisfied. But if I run a test until I get a stat sig result (as judged using the simple, non-sequential formula), reporting the sample sizes and numbers of conversions associated with each ad, and then my colleague runs a test with exactly those sample sizes, she may or may not get a stat sig result. If 100 other data scientists do the same, perhaps only 5 will get a stat sig result (which is how many false positives we’d expect when there isn’t a meaningful difference between the two ads, and we’re using a significance threshold of 0.05). That’s because I’m running a sequential test, but using the non-sequential formula, and that inflates my Type-I error rate. My colleagues are running tests with fixed sample sizes, and using the correct formula, and therefore draw the correct conclusions.

Again, all I am arguing is that if you use a sequential test, you should analyze it accordingly. You wouldn’t expect a colleague to run a different kind of test and get the same answer you did. But this violates the Strong Likelihood Principle, and therefore the Bayesian Coherency Principle. I can believe there are situations where the Strong Repeated Sampling Principle is not especially relevant. If I’m predicting the probability that Newsom survives the recall election, the concept of hypothetical repetitions is a bit ridiculous. This is a one time event! But in science we study physically meaningful quantities that have some kind of objective reality, and that means that multiple people should draw substantively equivalent conclusions.

What do Cox and Hinkley have to say? They too prefer the Strong Repeated Sampling Principle. Their reasons are: first; not all uncertainties are created equal. It doesn’t always make sense to put uncertainties deriving from physical systems on equal footing with uncertainties deriving from personal impressions. Second, that betting games are an interesting framework that are “at best interesting models of learning in the face of uncertainty”. Finally, that the Bayesian Coherency Principle is about internal consistency, but that the Repeated Sampling Principle is about consistency with external reality. I think this last point is the most relevant to deciding which procedure to use: if hypothetical repetitions are sensible in the external reality you are studying, prefer the Strong Repeated Sampling Property. Otherwise, you might as well be internally consistent!

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.

Related