The Winner's Curse: Why it Happens and What to Do About It
Maximum Likelihood Estimates are not especially good point estimates
Table of Contents
Introduction
Imagine we’ve run an A/B test to figure out which subject line is the best at making people open emails. This test involves ten different subject lines, each of which is sent to 1,000 people. The results are shown below.
Variant | Opens | People | Open Rate |
---|---|---|---|
A | 10 | 1000 | 1% |
B | 30 | 1000 | 3% |
C | 50 | 1000 | 5% |
D | 70 | 1000 | 7% |
E | 90 | 1000 | 9% |
F | 110 | 1000 | 11% |
G | 130 | 1000 | 13% |
H | 150 | 1000 | 15% |
I | 170 | 1000 | 17% |
J | 190 | 1000 | 19% |
Of course, if we actually observed results in such a clear pattern we would be deeply suspicious. But you can imagine we’ve simply labeled the variants in ascending order of observed performance; the perfect spacing is irrelevant to the point of this post; all that matters is there is some spread in observed performance.
When we plug these results into my A/B testing calculator, we see that most of these comparisons are statistically significant. For example, although there is no stat sig difference between variants I and J, or between H and J, there is between G and J, and between F and J, and so forth. Certainly J is one of the best variants, if not the best. Similarly, A is clearly one of the worst variants, if not the worst.
If we were to declare variant J “the winner”, and use it for all emails going forward, what kind of open rate might we expect? The maximum likelihood estimate is simply the observed open rate: 19%. Is that the best prediction of future performance? Well, I can virtually guarantee the performance will be worse than 19%, a phenomenon known as the Winner’s Curse. In this post, we discuss why the Maximum Likelihood Estimate actually does not work very well as a point estimate when we have several experiment groups.
Regression to the Mean
The Winner’s Curse is also known as “regression to the mean”, a phenomenon originally studied by Sir Francis Galton in the 1800s in the context of human stature. Galton observed that tall parents tended to have shorter children—not short, but shorter than the parents. And short parents tended to have taller children—not tall, but taller than the parents. He coined the term “regression to the mean” to describe this phenomenon.
And it’s a good thing this phenomenon exists! Imagine it were the opposite, that tall parents had taller children, and short parents had shorter children. The human race would bifurcate in a few generations, which clearly does not happen. Regression to the mean is a stabilizing force in evolution.
But why does it happen? It’s because height is a combination of genes and random chance, and these things balance out. But it’s easier to understand in a more familiar context: test scores.
Suppose a large group of students take a test. Maybe it’s an intro to statistics course with 1,000 students who are taking the midterm exam. Some of the students do very well, some of the students do pretty poorly, and many students are somewhere in the middle. We might decide the students who do very well are promising candidates for data science careers, and so we monitor their performance on the final. While some of the students still do well, most of the students do worse on the final than they had on the midterm. Meanwhile, many of the students who had done poorly on the midterm, do better on the final. The students we originally thought promising do worse, and the seemingly hopeless students do better: regression to the mean.
To understand this phenomenon, we need only consider why the students who did well on the midterm did so well. Certainly most of them have a pretty good understanding of the material, but many of them also just got lucky: they guessed correctly on the multiple choice questions, and coincidentally the problems asked were very similar to the problems they had studied. The students who did the best had a strong understanding of the material and got lucky. On the final, these students still had a strong understanding of the material, but were not, on average, as lucky as before. Most people don’t get lucky twice in a row. So the overall performance appeared to decrease.
Similarly, the students who did the worst on the midterm struggled with the material, but were also unlucky. They may have studied just as much as the other students, but happened to study problems that didn’t appear in the midterm. But it is just as unlikely to be unlucky twice in a row as it is to be lucky, so on average the performance improves.
We can isolate this phenomenon as: “if you find a group of lucky people, they will tend to be less lucky the next time”. And of course the statement works swapping lucky with unlucky. In contrast, if you start with just a group of people, with no preference towards the lucky or unlucky, we wouldn’t expect to see any particular shift in luck over time. It’s the initial selection of lucky people that creates this phenomenon, and whenever we grab the best (or worst) of a bunch, where “best” is at least partially influenced by luck, we are setting ourselves up for regression to the mean.
This shows up all over, and it goes by many names: regression to the mean, winner’s curse, sophomore slump—it’s the same phenomenon. Imagine there is some promising NBA player in his first year. Everyone thinks he’s the next Lebron James. But then the next season he doesn’t do as well. He still does well, but not as well as people expected, so everyone is disappointed. A musician comes out with an amazing album and becomes really popular, but then their second album is a flop. You find a new restaurant you love, but then the food isn’t as good the next time you go. In all of these cases, we’re selecting someone or something as being especially good, and in so doing we are condemning ourselves to disappointment.
And this phenomenon is exacerbated the more things are involved. How many new musicians release albums per year? For them to stand out, it isn’t enough for them to be excellent musicians, they also need to have the right sound at the right time. That’s luck! And the more musicians there are, the bigger the difference between the luckiest and unluckiest. The more monkeys typing randomly, the more likely one is to type Shakespeare. Mathematically, if the “luck” exhibited by person $i$ is $u_i \sim F$, then the luck associated with the luckiest person, $\xi_n = \max\{ u_1, \ldots, u_n \}$ will tend to increase with $n$.
The larger $n,$ the larger a role luck plays in the top (or bottom) performers. In an A/B test, the more experiment groups we have, the more likely we are to observe the Winner’s Curse.
Empirical Bayes and Hierarchical Models
Going back to our original example, it’s clear that while subject line J is probably pretty good, we should expect some kind of regression to the mean. When we communicate to the business what kind of performance they can expect, we should temper expectations. But how much?
The Empirical Bayes methodology is a simple and appealing approach. We postulate that the “true” open rates (which we do not observe) come from some distribution, such as a Beta distribution, with unknown shape parameters: $\pi^{(k)} \sim \mathrm{Beta}(\alpha, \beta)$. Then the observed opens in each group has a binomial distribution, conditional on the open rate: $y^{(k)} \; | \; \pi^{(k)} \sim \mathrm{Binom}(m^{(k)}, \pi^{(k)})$, where $m^{(k)}$ is the number of emails sent using subject line $i$, and $y^{(k)}$ is the corresponding number of opens. The observed open rate is $y^{(k)} / m^{(k)}$, which is probably close to but distinct from the true open rate $\pi^{(k)}$.
The Empirical Bayes method starts by estimating $\alpha$ and $\beta$ from the data, and then using the posterior distribution $\pi^{(k)} \; | \; y^{(k)}$ to predict the open rate going forward. It can be shown that the open rate so reported will be a weighted average of the observed performance for that subject line and the overall average across all subject lines. As a result, we will tend to shrink both the worst performances and the best performances towards the average, exactly agreeing with the regression to the mean we anticipate.
The “Empirical” part is that we are estimating the parameters of the prior distribution, $\alpha$ and $\beta$ directly from the data. Bayesian folks don’t like this terminology because it makes it sound like other Bayesian approaches aren’t empirical. But I’d rather not get hung up on semantics. It’s a neat approach!
We can do even better. The subject lines might fall into categories, in which case it makes sense to shrink estimates based only on other subject lines in the same category, or giving more weight to those subject lines. Or more generally if we have metadata on the different subject lines, model the prior distribution as depending on these features. In this approach, we can even predict how untested subject lines will perform! See Bayesian Data Analysis by Gelman et al for details.
But What About Maximum Likelihood Estimates?
The narrative above is sensible enough, but it was shocking in the 1950s when Charles Stein first published an argument like the above. At the time, the statistics community was enamored with approaches based on maximum likelihood estimates (MLEs). In our subject line example, the MLEs are simply the observed open rates. MLEs are often simple and intuitive, and have many useful properties like consistency and asymptotic normality. But Stein constructed an estimator, in the context of linear regression, that was uniformly better than MLE in typical situations. From a modern perspective, we would consider this an early form of regularization, and we can (loosely) summarize Stein’s conclusion as: regularization almost always improves point estimates.
Since then, regularization (and Bayesian approaches like Empirical Bayes and Hierarchical Modeling that accomplish the same result) has been embraced in statistics (e.g. ridge regression, the lasso) and especially in Machine Learning, with techniques such as drop out and early stopping in neural networks.
Okay, so regularization almost always improves point estimates. What about hypothesis tests? What about confidence intervals? I think here the answer is more complicated. In the context of A/B tests where the sample size is much larger than the number of parameters in our models, I would hesitate to change my approach for assessing statistical significance or reporting confidence intervals.
With regards to hypothesis testing, we need to ask if a regularization-based test statistic has better power than those we would normally use. Under the typical assumptions of “classical” statistics, e.g. the number of parameters is fixed as the sample size increases, then we ask: do we have simple null and alternative hypotheses? If so, then the likelihood ratio test is most powerful. If the alternative hypothesis is compound, then the score test is most powerful against local alternatives. If the null hypothesis is compound, then the maximum likelihood ratio statistic is asymptotically sufficient and thus most powerful, and is also asymptotically equivalent to the score and Wald tests. These are classic results in statistical inference, and regularization plays no role here. But in a high dimensional setting where we are looking for a few promising candidates in a sea of null effects, a lasso-based approach will likely work very well. That’s usually not the situation in A/B testing.
With regards to confidence intervals, it’s a similar story. A good confidence interval is characterized by having the correct coverage above all else. Confidence intervals based on inverting hypothesis tests automatically have the correct coverage (assuming the hypothesis test does). So I always make my confidence interval based on my hypothesis test, not based on my point estimate. (I certainly do not compute a confidence interval as the point estimate plus or minus two standard errors.)
I do worry that under certain circumstances we might end up adjusting the point estimate so that it is nowhere near the center of the confidence interval. It conceivably could even be outside of the interval! I feel like if this happened in practice, it’s an indication we have data issues, or that our model assumptions (e.g. that all open rates are drawn from the same Beta distribution) are wrong. I just can’t imagine this happening under normal circumstances.
As we discussed above, this regression to the mean phenomenon is exacerbated the more experiment groups we have. I probably wouldn’t bother with two or three experiment groups, but I would definitely consider it for five or more. It certainly does no harm to calculate both sets of point estimates and see how different they are!
Estimating the Prior Parameters
Added February 20, 2022.
I haven’t really been able to find a detailed explanation of how to estimate the prior parameters $\alpha$ and $\beta$ from the data. I’ve seen some hand-wavy explanations. I’ve seen instances where the Binomial distribution is approximated as Gaussian. I’ll sketch out here what I know so far.
Recall: $$ \begin{align} y^{(k)} \; | \; \pi^{(k)} &\sim \mathrm{Binom}\left(m^{(k)}, \pi^{(k)}\right) \\\ \pi^{(k)} &\sim \mathrm{Beta}(\alpha, \beta), \quad k=1, \ldots, K \end{align} $$
As derived in Theory of Point Estimation by Erich Leo Lehmann and George Casella, section 4.6, the marginal distribution of $\alpha$ and $\beta$ is: $$ L(\alpha, \beta) = \Pi_{k=1}^K {m^{(k)} \choose y^{(k)}} \frac{\Gamma(\alpha + \beta) \Gamma(\alpha + y^{(k)}) \Gamma(\beta + m^{(k)} - y^{(k)})}{\Gamma(\alpha) \Gamma(\beta) \Gamma(\alpha + \beta + m^{(k)})}. $$ Taking the logarithm and omitting terms which do not depend on $\alpha$ or $\beta$, the log-likelihood is: $$ \begin{align} \ell(\alpha, \beta) &= \sum_{k=1}^K \left[ \log \Gamma(\alpha + \beta) + \log \Gamma(\alpha + y^{(k)}) + \log \Gamma(\beta + m^{(k)} - y^{(k)}) \right. \\\ &\phantom{=} \hspace{20pt} \left. - \log \Gamma(\alpha) - \log \Gamma(\beta) - \log \Gamma(\alpha + \beta + m^{(k)}) \right] \\\ &= K \left[ \log \Gamma(\alpha + \beta) - \log \Gamma(\alpha) - \log \Gamma(\beta) \right] \\\ &\phantom{=} + \sum_{k=1}^K \left[ \log \Gamma(\alpha + y^{(k)}) + \log \Gamma(\beta + m^{(k)} - y^{(k)} ) \right. \\\ &\phantom{=} \hspace{30pt} \left. - \log \Gamma(\alpha + \beta + m^{(k)}) \right]. \end{align} $$
Now, according to Convex Optimization by Stephen Boyd and Lieven Vandenberghe, section 3.5.1, $\Gamma(x)$ is log-convex for $x \geq 1$. That doesn’t seem to help us much, because we have differences of convex functions that we wish to maximize. This point seems to be completely ignored both by Lehmann and Casella, and in Empirical Bayes Methods by J.S. Maritz and T. Lwin.
Using the relationship between Beta and Gamma functions, we can re-write the log-likelihood as $$ \ell(\alpha, \beta) = -K \log B(\alpha, \beta) + \sum_{k=1}^K \log \left( B(\alpha + y^{(k)}, \beta + m^{(k)} - y^{(k)}) \right). $$ Although this isn’t in Convex Optimization, a little Googling suggests that the Beta function is also log-convex, so the first term of this is concave, but the other terms are convex. So this doesn’t help much either.
Given a good initial guess of $\alpha$ and $\beta$, perhaps a local maximum would work well. That’s what the texts above do, without even acknowledging the issue. It is a little frustrating, though, when the function you’re trying to maximize is not concave.
We may be able to use a convex-concave procedure to improve on a simple local search. We write the problem as: $$ \begin{array}{cl} \mbox{minimize} & f(\alpha, \beta) - g(\alpha, \beta), \quad (1) \end{array} $$ where $$ \begin{align} f(\alpha, \beta) &= K \log B(\alpha, \beta), \\\ g(\alpha, \beta) &= \sum_{k=1}^K \log \left( B(\alpha + y^{(k)}, \beta + m^{(k)} - y^{(k)}) \right) \end{align} $$ are both convex. We form a linear approximation to $g$ and use that to approximately solve $(1)$. Given estimates $\alpha_0$ and $\beta_0$, we approximate $$ g(\alpha, \beta) \approx \hat{g}(\alpha, \beta) = g(\alpha_0, \beta_0) + \nabla g(\alpha_0, \beta_0)^T \left[ \begin{array}{c} \alpha - \alpha_0 \\\ \beta - \beta_0 \end{array} \right], $$ and then solve $$ \begin{array}{cl} \mbox{minimize} & f(\alpha, \beta) - \hat{g}(\alpha, \beta). \end{array} $$ The solution to this problem is used to create a new approximation to $g$, and we continue this process until convergence.
This method requires a good initial guess of $\alpha$ and $\beta$ to work well. We could use a Gaussian approximation to the Binomial distribution and fit an Empirical Bayes model. Based on the estimated (Gaussian) prior, we find a Beta distribution that is a good fit, either matching the mean and variance, or by sampling data from the Gaussian distribution and finding a Maximum Likelihood Beta distribution.
Is this better than local search? I’d like to simulate some data and calculate the Bayes error rate associated with (a) the true $\alpha$ and $\beta$ (which is the best the error rate could be), (b) the Gaussian approximation, (c) local search, and (d) the convex-concave procedure outlined above. It may be the case that although MLE for the Beta-Binomial distribution is not a convex optimization problem, and is therefore subject to non-global local optima, the local search still works just as well.