Contingency Tables Part IV: The Score Test
(This is the fourth post in our series on contingency table-based A/B tests. Read the first post here.)
In our last post, we showed how to compute the Maximum Likelihood Estimates (MLE) of the model parameters under the null and alternative hypotheses. These allow us to calculate a p-value against the null hypothesis, and more importantly a confidence interval on the treatment effects consistent with the data.
There are three common techniques for hypothesis testing based on the maximimum likelihood estimates: the Likelihood Ratio Test, the Wald Test, and the Score Test (also known as the Lagrange Multiplier Test).
The likelihood ratio test compares the values of the likelihood function at the MLEs under the null and alternative hypotheses. If the likelihood associated with the alternative is meaningfully higher than that associated with the null hypothesis, that constitutes evidence against the null.
The Wald test compares the parameter values under the null and alternative hypotheses. If they are meaningfully far apart, that constitutes evidence against the null.
The score test considers the slope of the likelihood function at the parameter value associated with the null hypothesis. If the null hypothesis were true, we would expect the MLEs to be close to the maximum of the likelihood function, where the slope is zero. If the slope is large, that constitutes evidence against the null. In his book, “Categorical Data Analysis”, Alan Agresti reports that the score test works better than the other methods for medium sample sizes, even though all three tests are asymptotically equivalent.
The Test Statistic
Recall that the contingency table we are analyzing looks like this:
Experiment Group | Successes | Failures | Trials |
---|---|---|---|
$A$ | $s_A$ | $f_A$ | $n_A$ |
$B$ | $s_B$ | $f_B$ | $n_B$ |
TOTAL | $s$ | $f$ | $n$ |
Like in the last post, we assume
- $s_A \; | \; n_A \sim \textrm{Binom}(n_A, \pi_A)$,
- $s_B \; | \: n_B \sim \textrm{Binom}(n_B, \pi_B)$, and
- $s_A$ and $s_B$ are independent given $n_A$ and $n_B$.
Further, let $\hat{\pi}_A$ and $\hat{\pi}_B$ be the maximum likelihood estimates of $\pi_A$ and $\pi_B$ under the null hypothesis. Recall that these are calculated by solving a convex optimization problem, and need not correspond to a closed-form equation.
Since the contingency table consists of two independent binomial distributions, the likelihood for the entire table is simply the product of terms corresponding to the rows: $$ \begin{align} L(\vec{\pi}; \vec{s}, \vec{n}) &= {n_A \choose s_A} \cdot \pi_A^{s_A} \cdot (1 - \pi_A)^{n_A-s_A} \\ &\phantom{=} \cdot {n_B \choose s_B} \cdot \pi_B^{s_B} \cdot (1 - \pi_B)^{n_B-s_B}. \end{align} $$
The score, $u$, is the derivative of the log-likelihood, evaluated at $\hat{\pi}_A$ and $\hat{\pi}_B$. This is: $$ u = \left(\begin{array}{c} \frac{s_A}{\hat{\pi}_A} - \frac{n_A - s_A}{1 - \hat{\pi}_A} \\ \frac{s_B}{\hat{\pi}_B} - \frac{n_B - s_B}{1 - \hat{\pi}_B} \end{array}\right) = \left(\begin{array}{c} \frac{s_A - n_A \hat{\pi}_A}{\hat{\pi}_A (1 - \hat{\pi}_A)} \\ \frac{s_B - n_B \hat{\pi}_B}{\hat{\pi}_B (1 - \hat{\pi}_B)} \end{array}\right). $$
The Fisher information is the negative expected value of the matrix of second derivatives of the log-likelihood function, evaluated at $\hat{\pi}_A$ and $\hat{\pi}_B$. That’s quite a mouthful, so let’s take that one step at a time. The matrix of second derivatives is: $$ \left(\begin{array}{cc} -\frac{s_A}{\pi_A^2} - \frac{n_A - s_A}{(1 - \pi_A)^2} & 0 \\ 0 & -\frac{s_B}{\pi_B^2} - \frac{n_B - s_B}{(1 - \pi_B)^2} \end{array}\right). $$
In this expression, $s_A$ and $s_B$ are considered random, with $s_A \; | \; n_A \sim \textrm{Binom}(n_A, \pi_A)$, so $E[s_A] = n_A \pi_A$ and $E[s_B] = n_B \pi_B$. Therefore, the negative expected value of the matrix of second derivatives is $$ \left(\begin{array}{cc} \frac{n_A \pi_A}{\pi_A^2} + \frac{n_A - n_A \pi_A}{(1 - \pi_A)^2} & 0 \\ 0 & \frac{n_B \pi_B}{\pi_B^2} + \frac{n_B - n_B \pi_B}{(1 - \pi_B)^2} \end{array}\right) = \left(\begin{array}{cc} \frac{n_A}{\pi_A (1 - \pi_A)} & 0 \\ 0 & \frac{n_B}{\pi_B (1 - \pi_B)} \end{array}\right). $$
Finally, the Fisher information, $I$, is this quantity evaluated at $\hat{\pi}_A$ and $\hat{\pi}_B$: $$ \left(\begin{array}{cc} \frac{n_A}{\hat{\pi}_A (1 - \hat{\pi}_A)} & 0 \\ 0 & \frac{n_B}{\hat{\pi}_B (1 - \hat{\pi}_B)} \end{array}\right). $$
The test statistic, $\Upsilon$, of the score test is $$ \begin{align} \Upsilon &= u^T I^{-1} u \\ &= \frac{(s_A - n_A \hat{\pi}_A)^2}{n_A \hat{\pi}_A (1 - \hat{\pi}_A)} + \frac{(s_B - n_B \hat{\pi}_B)^2}{n_B \hat{\pi}_B (1 - \hat{\pi}_B)}. \end{align} $$
It can be shown (see “Testing Statistical Hypotheses” by Erich Lehmann and Joseph Romano) that this test statistic is equal to Pearson’s $\chi^2$ test statistic and has a $\chi_1^2$ distribution under the null hypothesis. We can use this fact to determine that $1 - \Phi_{\chi_1^2}(\Upsilon)$ is a p-value against the null hypothesis, where $\Phi_{\chi_1^2}$ is the cumulative distribution function.
Confidence Intervals
Recall that our null hypothesis can come in multiple flavors. We have considered null hypotheses like:
- Equality: $\pi_B = \pi_A$
- Relative lift: $\pi_B = \pi_A \cdot (1 + d)$
- Absolute lift: $\pi_B = \pi_A + d$
Equality is clearly a special case of either the relative or absolute lift with $d=0$. In any given situation, we need to decide whether it makes more sense to think in relative or absolute terms. Typically I think it makes more sense to say something like “B is 10% better than A”, but I have definitely encountered situations where the absolute lift was appropriate.
Either way, a confidence interval is defined relative to the p-value calculated in the previous section. When the p-value is less than some threshold $\alpha$, we say the result is statistically significant $(\alpha = 0.05$ is conventional but arbitrary). The p-value we calculate depends on the flavor of the null hypothesis (relative or absolute lift) and the quantity $d$. A $100(1-\alpha)\%$ confidence interval on the lift is the set of values $d$ such that we do not reject the null hypothesis using a threshold $\alpha$.
When we reject a null hypothesis, we are saying that it is incompatible with the data; when we do not reject it, we are saying that it is plausibly consistent with the data. So a confidence interval is a range of possibilities that are consistent with the data.
To put this mathematically, suppose $\Upsilon(d)$ is the test statistic calculated using the null hypothesis corresponding to $d$. And for definiteness, let’s assume we’re working with the relative lift, but this would be just as true for the absolute lift. Then a 95% confidence interval on the relative lift is the set $\{d : 1 - \Phi_{\chi_1^2}(\Upsilon(d)) \geq 0.05 \}$.
This region is contiguous, and therefore creates three possible kinds of values for $d$:
- There are values of $d$ that are too small to be plausible
- There are values of $d$ that are too large to be plausible
- There are values that are just right.
A confidence interval is typically expressed in terms of its endpoints: the two values $d$ such that $1 - \Phi_{\chi_1^2}(\Upsilon(d)) = \alpha$. These points can be found via bisection. First, find a point $Q_i$ within the confidence interval and a point $Q_o$ outside of the interval (say, on the bigger side). The observed lift is always within the interval, because it is completely compatible with the data, and a suitably large lift is outside of the interval. Then check the p-value associated with the midpoint: if it is greater than $\alpha$, that midpoint is part of the interval and replaces $Q_i$; otherwise it replaces $Q_o$. Repeat until $Q_o$ and $Q_i$ are suitably close together. Repeat for the other side.
The Power of the Score Test
Statistical power is the probability of rejecting the null hypothesis given a particular alternative hypothesis. The idea here is that if the null hypothesis is in fact false, we would like to reject it with high probability. But if the null hypothesis is close to true, it might be very difficult to reject, requiring larger sample sizes.
The hypotheses we have considered so far are composite hypotheses. For example, we might have a null hypothesis that $B$ is 10% better than $A$, but there are an infinite number of success rates having this relationship. While we can calculate p-values and confidence intervals for composite hypotheses, power is only defined with respect to simple hypotheses of the form $\pi_A = p_A$ and $\pi_B = p_B$. Instead of expressing a relationship between $\pi_A$ and $\pi_B$, we have to specify precisely what they are. This is much less useful, since if we knew these values, we wouldn’t need to run the test in the first place.
Nonetheless, power for composite hypotheses is ambiguous, changing depending on the actual values. Suppose our null hypothesis is that $\pi_A = p_A$ and $\pi_B = p_B$, and that our alternative hypothesis is that $\pi_A = p_A^\prime$ and $\pi_B = p_B^\prime$. Then under the alternative hypothesis, the test statistic $\Upsilon$ has a noncentral $\chi_{1,\lambda}^2$ distribution, with noncentrality parameter $$ \lambda = \frac{n_A (p_A - p_A^\prime)^2}{p_A (1 - p_A)} + \frac{n_B (p_B - p_B^\prime)^2}{p_B (1 - p_B)}. \quad (1) $$ (See “Testing Statistical Hypotheses” by Erich Lehmann and Joseph Romano.) We reject the null hypothesis if $\Upsilon \geq \Upsilon_{\textrm{crit}} = \Phi_{\chi_1^2}^{-1}(1-\alpha)$, which happens with probability $1 - \Phi_{\chi_{1, \lambda}^2}(\Upsilon_{\textrm{crit}})$. This is the power of the test.
Through the noncentrality parameter $\lambda$, the power depends on the sample sizes ($n_A$ and $n_B$), the null hypothesis ($p_A$ and $p_B$), and the alternative hypothesis ($p_A^\prime$ and $p_B^\prime$). A larger value $\lambda$ will lead to higher power. We get a larger value of $\lambda$ if the sample sizes are larger, or if the null and alternative hypotheses are more different. The sample sizes are what we the experimenters control, so we can increase these to get the power as high as we want (80% power is standard but arbitrary).
As we mentioned, the power is only defined for simple hypotheses (e.g. $\pi_A = 0.10$ and $\pi_B = 0.12$), but we often frame business objectives in terms of composite hypotheses (e.g. $\pi_B$ is 20% higher than $\pi_A$). Thus we have to translate composite hypotheses into simple hypotheses somehow. For example, suppose that the legacy experience is $A$ and $B$ is a new experience we’ve never tried before. Based on historic data, we think that $\pi_A$ is around 10%. In order to justify switching to $B$, we would need the new experience to be at least 20% better than the legacy experience (technically this is a one-sided hypothesis, but we’ll skip that nuance for now). Then our null hypothesis is that $\pi_B = \pi_A$ and our alternative hypothesis is that $\pi_B$ is 20% higher than $\pi_A$.
An intuitive but flawed way of translating this into simple null and alternative hypotheses is to set $p_A = p_B = p_A^\prime = 10\%$, and $p_B^\prime = 12\%$. The idea here is that we really do think the success rate associated with the legacy experience is 10%, based on historic data. The success rate associated with the new experience is unknown: we hope it’s at least 12% but it might still be 10%. Examining the formula for $\lambda$, we see the problem with this approach: since $p_A = p_A^\prime$, the first term is zero, and it doesn’t matter how large a sample size we give to the legacy experience. Whether it’s zero or infinity, the power is the same. This doesn’t make any practical sense, so this isn’t a great way of translating our composite hypotheses into the simple hypotheses required.
So instead let’s start with the alternative hypothesis that $p_A^\prime = 10\%$ and $p_B^\prime = 12\%$. If this were true, and we were testing our composite null hypothesis that $p_A = p_B$, then we would wind up with maximum likelihood estimates for these parameters somewhere between 10% and 12%. (If the sample sizes were equal, we would wind up with $p_A = p_B = 11\%$.) And then plugging these values into Equation (1) we get something reasonable.
To make this intuition a little more general, we start off with a composite alternative hypothesis like “the success rate associated with option B is 20% higher than for A”. Then we translate this into a simple alternative hypothesis, by setting $p_A^\prime$ based on historic data, and then setting $p_B^\prime$ to be 20% higher.
Next, based on given sample sizes $n_{A/B}$, we calculate the expected number of successes in each group: $s_{A/B} = p_{A/B}^\prime \cdot n_{A/B}$. Then we calculate $p_{A/B}$ as the maximum likelihood estimates corresponding to this scenario. That is, as the solution to: $$ \begin{array}{cl} \mbox{maximize} & \ell\ell(\vec{p}; \vec{s}, \vec{n}) \\\ \mbox{subject to} & H_0, \end{array} $$ where $H_0$ could represent the null hypothesis of equality, or of any particular lift. The solution to this optimization problem gives $p_{A/B}$, which can be plugged into Equation (1) alongside $p_{A/B}^\prime$ and $n_{A/B}$ to calculate the noncentrality parameter $\lambda$ and thus calculate the power of the test against the alternative.
We again emphasize the power is ambiguous for composite hypotheses: our approach seems to work well in simulation studies. Our approach is flexible: it works with both composite null and alternative hypotheses (although the very first step involves applying domain expertise to translate the composite alternative into a simple alternative); and it works for null hypotheses of equal success rates, or nonzero relative or absolute lifts. This is the approach currently implemented in my A/B testing calculator. Most other calculators only support calculating power for the null hypothesis of equal success rates! The original implementation used the naive approach, so if you have been using my calculator for a while, and you see different power calculations than you used to, this is why!