Testing with Many Variants

This is a Long Drive for Someone with Nothing to Think About

Sep 5, 2021 11 min read

Introduction

We drove to Palm Springs recently for the long weekend. There was a sign: “Be Alert. Drive sober. Not wrong way.” that I found really confusing. Is it saying that driving sober is not the wrong way, i.e. it’s the right way? Is it saying don’t drive the wrong way down the highway and driving sober is the only alternative? I mean clearly it’s just trying to remind people not to drive drunk; I just found the wording to be really strange.

Then I started thinking: does this sign save any lives? Does it really dissuade people from drunk driving? And I started thinking how I would test that. Pick, say, the top N largest cities in the US, select N/2 of them at random and display this sign on major highways; display nothing in the other cities and compare incidents of drunk driving. We’d probably want to normalize based on city size, or do diff-in-diff or something.

We might not see any effect, but technically that would just demonstrate that particular slogan was ineffective at reducing drunk driving. Perhaps some other slogan would be better. So maybe we come up with ten to twenty different slogans (hopefully better than the one that inspired this post) and test all of them. We could test them one at a time, but that would take a long time. Or we could test all of them simultaneously, but then we have reduced statistical power and we would be unlikely to detect any but the largest effect sizes. But if we did this and somehow got good power, and none of the slogans really seemed to have much of an impact, we might conclude that these signs in general don’t really have much of an impact. (The argument could be made that if they save even one life, it’s worth it. Fine, but I started thinking about similar scenarios with much lower stakes, like testing out different colors for buttons, another situation where I’m skeptical it makes much of a difference.)

When we’re testing lots of different variants, of a subject line in an email, settings in a recommendation algorithm, or slogans in an anti-drunk-driving campaign, we run into the multiple comparisons problem. With $n$ variants, we inevitably want to compare each variant to each other variant (perhaps including some control or legacy option), for $n \cdot (n-1)/2$ comparisons. For large $n$ we pay a heavy price in terms of statistical power, even when doing more sophisticated approaches for multiple comparisons like Scheffé’s method or Benjamini-Hochberg.

One alternative, is to cast a wide net and ask if there is any evidence of difference between the different options. If we (at least temporarily) give up on finding the best slogan, and just try to assess whether there is any difference between the slogans (and possibly a control), that could tell us whether there is any point in devoting further resources to identifying the best one.

Here’s what that would look like. We test $K$ variants (including a control with no slogan) in $N$ cities (for simplicity we’ll assume $K$ divides $N$). We assume the number of drunk driving incidents over, say 6 months, follows an over-dispersed Poisson distribution with mean $\beta_i$ corresponding to the slogan variant used, and overdispersion parameter $\phi$. The null hypothesis is that $\beta_1 = \beta_2 = \cdots = \beta_K$. If the null hypothesis is true, that means that the choice of slogan, or even using a slogan at all, doesn’t make any difference. Rejecting the null hypothesis would correspond to evidence that at least one of the slogans made a difference to drunk driving incidents. Whichever variant (including the variant where we don’t show a slogan at all) was associated with the fewest drunk driving incidents would perhaps be worthy of further investigation, but our particular choice of null hypothesis would not permit us definitively to state that variant was the best.

Hypothesis Testing

We can use a technique described by Simon Wood in his book Generalized Additive Models to calculate a p-value against this null hypothesis. It’s basically the likelihood ratio test, with an adjustment since we don’t know the dispersion parameter, similar to how the $t$-test generalizes the $z$-test.

First we calculate the maximum likelihood estimates (MLE) of the parameters under the alternative hypothesis, which is simply the observed rate of drunk driving incidents corresponding to each variant. Call these $\hat{\beta}_1$. Nex compute the MLE of the parameters under the null hypothesis (more on this in a bit); call these $\hat{\beta}_0$. Then $$ \Upsilon := \frac{(\ell \ell(\hat{\beta}_1) - \ell \ell(\hat{\beta}_0)) / (K-1)}{\ell \ell(\hat{\beta}_1) / (N-K)} \sim F_{N-K}^{K-1}, $$ where $\ell \ell$ is the log-likelihood for the Poisson distribution. Technically the dispersion parameter plays a role in the log-likelihood, but we are assuming it is a constant, so it cancels out in top and bottom and we can ignore it. Wood describes this as being more reliable than trying to estimate the dispersion parameter and then using the more common likelihood ratio test.

So we could compute the observed value of $\Upsilon$ and compare it to an upper tail value for the $F$ distribution to calculate a p-value and assess statistical significance. How do we calculate $\hat{\beta}_0$? By solving: $$ \begin{array}{cl} \mbox{maximize} & \ell \ell(\beta) \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \; \ell \ell, \; i=1, \ldots, K \\\ & \beta_1 = \beta_2 = \cdots = \beta_K. \end{array} $$ Since the Poisson likelihood is log-concave, this is a convex optimization problem. In many cases, this problem can be solved by inspection; for example $\hat{\beta}_0$ could simply be the overall rate of drunk driving.

Opportunity Sizing

Suppose we reject the null hypothesis. What that tells us is that at least one of the $\beta$’s is different from the others. It doesn’t tell us which variant is best, which is a natural question not addressable without correcting for the comparisons corresponding to comparing each variant to each other variant, and the concomitant loss in statistical power that drove us to the approach above.

Of course we can look at whichever $\beta$ is observed to be the best (lowest in this case, since lower is better) and compare it to the others, and especially to the $\beta$ corresponding to no slogan at all. That would give us some indication of how big a difference the right slogan makes. If that difference is small, we are in the situation where we have statistical significance but not practical significance, as Ron Kohavi describes it in his book Trustworthy Online Controlled Experiments. (Even a small difference means lives saved, but in other situations, a small improvement may not be so important.) If the difference is large, it’s clear there is a real opportunity to explore further, perhaps doing a subsequent test focusing on just the observed-best slogans to definitively find the best one.

There is always the possibility the random assignment has exaggerated the opportunity size. So we would like some conservative estimate to be sure. If $\beta_1$ is the parameter corresponding to no slogan, we might be interested in the quantity $$ f(\beta) = \max\{\beta_1 - \beta_2, \beta_1 - \beta_3, \ldots, \beta_1 - \beta_K\}. $$ Since $f$ is a convex function of $\beta$, we can use Scheffé’s method to calculate a lower confidence bound on $f(\beta)$ by solving $$ \begin{array}{cl} \mbox{minimize} & f(\beta) \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \; \ell \ell, \; i=1, \ldots, K \\\ & \frac{(\ell \ell(\hat{\beta}_1) - \ell \ell(\beta)) / (K-1)}{\ell \ell(\hat{\beta}_1) / (N-K)} \leq F_{\alpha; N-K}^{K-1}, \end{array} $$ which in turn is equivalent to: $$ \begin{array}{cl} \mbox{minimize} & t \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \; \ell \ell, \; i=1, \ldots, K \\\ & \beta_1 - \beta_j \leq t, \; j=2, \ldots, K \\\ & \frac{(\ell \ell(\hat{\beta}_1) - \ell \ell(\beta)) / (K-1)}{\ell \ell(\hat{\beta}_1) / (N-K)} \leq F_{\alpha; N-K}^{K-1}. \end{array} $$ The value of this problem, $t^\star$, is a conservative estimate of the opportunity size. If $t^\star$ corresponds to a material impact, we can be confident that one of the options is at least this much better than the control.

(It might seem like an upper bound is also worth examining. Unfortunately, computing such an upper bound is NP-Hard, corresponding to maximizing a convex function over a convex region. The observed value $f(\hat{\beta}_1)$ is a more practical upper bound on the opportunity size. If this is small, there really is no reason to continue exploring.)

A Less Conservative Lower Bound

It may be the case that $t^\star < 0$, even when we reject the null hypothesis of no effect, since Scheffé’s method can be overly conservative. It provides simultaneous intervals for however many functions we want, and we really only care about the one. We can compute a non-simultaneous interval as follows.

Consider calculating a p-value against the null hypothesis that $f(\beta) = d$. To do so, we would need to solve: $$ \begin{array}{cl} \mbox{maximize} & \ell \ell(\beta) \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \; \ell \ell, \; i=1, \ldots, K \\\ & f(\beta) = d. \end{array} $$ Since the feasible region is not convex, solving this problem is NP-Hard. But consider a convex relaxation: $$ \begin{array}{cl} \mbox{maximize} & \ell \ell(\beta) \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \; \ell \ell, \; i=1, \ldots, K \\\ & f(\beta) \leq d, \end{array} $$ which is equivalent to: $$ \begin{array}{cl} \mbox{maximize} & \ell \ell(\beta) \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \; \ell \ell, \; i=1, \ldots, K \\\ & \beta_1 - \beta_j \leq d, \; j=2,\ldots,K \end{array} $$ If the solution $\beta^\star$ satisfies $f(\beta^\star) = d$, then clearly it is also optimal for the original problem. Suppose $f(\hat{\beta}_1) > d$; that is, the MLE of the parameters under the alternative hypothesis are infeasible under $H_0$. Then we will argue that the constraint is active, that there must be at least one $j$ with $\beta_1^\star - \beta_j^\star = d$, and thus $f(\beta^\star) = d$.

Consider the optimality conditions: $$ \begin{align*} \beta_1^\star - \beta_j^\star - d &\leq 0, \; j=2,\ldots,K \\\ \lambda_j^\star &\geq 0, \; j=2,\ldots,K \\\ \lambda_j^\star \cdot (\beta_1^\star - \beta_j^\star - d) &= 0, \; j=2,\ldots,K \\\ \nabla \ell \ell(\beta^\star) + \sum_{j=2}^K \lambda_j^\star \nabla(\beta_1^\star - \beta_j^\star) &= 0. \end{align*} $$

Suppose none of the constraints are active, so that $\beta_1^\star - \beta_j^\star - d < 0$ for all $j$. That then requires that $\lambda_j^\star = 0$ for all $j$, and the last optimality condition is $\nabla \ell \ell (\beta^\star) = 0$. The solution to this equation is $\hat{\beta}_1$, but this contradicts the assumption that $\hat{\beta}_1$ is infeasible, so at least one $\lambda_j$ must be non-zero, at least one constraint must be active, and thus $f(\beta^\star) = d$.

Thus, provided we are interested in null hypotheses where the MLE under the alternative hypothesis is infeasible, we can efficiently compute the MLE under $H_0$ and use the $F$ test to compute a p-value. Now the test statistic is: $$ \Upsilon_d := \frac{\ell \ell(\hat{\beta}_1) - \ell \ell(\hat{\beta}_0)}{\ell \ell(\hat{\beta}_1) / (N-K)} \sim F_{N-K}^{1}, $$ since the null hypothesis only imposes a single constraint. The notation is intended to make it clear that the test statistic is indexed by $d$. A lower bound on $f(\beta)$ is the argmin of $\Upsilon_d$ subject to the constraint that the test statistic is not in the rejection region of the test. That is, the lower bound, $d^\star$ is the solution of: $$ \begin{array}{cl} \mbox{minimize} & d \\\ \mbox{subject to} & \Upsilon_d \leq F_{\alpha, N-K}^1, \end{array} $$ which is equivalent to $$ \begin{array}{cl} \mbox{minimize} & d \\\ \mbox{subject to} & \beta_i \in \mathbf{dom} \, \ell \ell, ; i=1,\ldots,K \\\ & \beta_1 - \beta_j \leq d, \; j=2,\ldots,K \\\ & \frac{\ell \ell(\hat{\beta}_1) - \ell \ell(\beta)}{\ell \ell(\hat{\beta}_1) / (N-K)} \leq F_{\alpha, N-K}^1. \end{array} $$ Comparing to the last problem in the previous section, this bound will be tighter if $(K-1) F_{\alpha, N-K}^{K-1} > F_{\alpha, N-K}^1$ which seems like it would always be the case.

Conclusions

When facing myriad variations, it can be quite challenging to identify the best option because any available sample size gets split into many small groups. This effect is then exacerbated by having to correct for the multiple comparisons corresponding to comparing each variant to each other variant.

It can be helpful first to assess whether there is evidence of any difference at all between the variants. We can do this with a smaller sample size than would be required to identify the best variant. The observed differences in variants provides some clue as to how much opportunity there is in further exploration. A conservative estimate of opportunity size can be calculated efficiently, and guards against the scenario of random assignment exaggerating the opportunity.

P.S. the subtitle of this post is the name of an album by my all time favorite band, Modest Mouse.

Bob Wilson

Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.