# Sprinkle some Maximum Likelihood Estimation on that Contingency Table!

*(This is the third post in our series on contingency table-based A/B tests.
Read the first post here.)*

In our last post, we showed that the contingency table used to summarize the results in an A/B test is associated (at least approximately) with the binomial distribution.

Experiment Group | Successes | Failures | Trials |
---|---|---|---|

$A$ | $s_A$ | $f_A$ | $n_A$ |

$B$ | $s_B$ | $f_B$ | $n_B$ |

TOTAL | $s$ | $f$ | $n$ |

For the contingency table above, we assume

- $s_A \; | \; n_A \sim \textrm{Binom}(n_A, \pi_A)$,
- $s_B \; | \: n_B \sim \textrm{Binom}(n_B, \pi_B)$, and
- $s_A$ and $s_B$ are independent given $n_A$ and $n_B$.

These statements are only approximately true, as discussed in the previous post, but when the sample sizes are large it is a good approximation that allows us to calculate p-values and confidence intervals very quickly. (Note also that we are conditioning on the sample sizes in each group, $n_A$ and $n_B$. In a completely randomized test, these are fixed by design, but in a Bernoulli trial, these are random, but not of interest to the experimenter. The binomial proportions $\pi_A$ and $\pi_B$ are considered fixed but unknown.)

This post is concerned with estimation of $\pi_A$ and $\pi_B$, both in general and in the context of a particular null hypothesis. We will use Maximum Likelihood Estimation (MLE). We will see that MLE results in asymptotically unbiased, consistent estimators of $\pi_A$ and $\pi_B$. These estimates can then be used to calculate p-values.

As the name suggests, we will need the likelihood function for the contingency table. The likelihood function is simply the probability mass function, intepreted as a function of $\pi$. We will let $\pi$ refer to the true (unknown) value, and use $p$ as a placeholder for candidate estimates. For the binomial distribution, the likelihood function is:

$$ L(p; s, n) = {n \choose s} \cdot p^s \cdot (1 - p)^{n-s}. $$

Since the contingency table consists of two independent binomial distributions, the likelihood for the entire table is simply the product of terms corresponding to the rows: $$ \begin{align} L(\vec{p}; \vec{s}, \vec{n}) &= {n_A \choose s_A} \cdot p_A^{s_A} \cdot (1 - p_A)^{n_A-s_A} \\ &\phantom{=} \cdot {n_B \choose s_B} \cdot p_B^{s_B} \cdot (1 - p_B)^{n_B-s_B}. \end{align} $$

Without any kind of hypothesis on the values of $\pi_A$ and $\pi_B$, the Maximum Likelihood Estimates (MLE; note the “E” in “MLE” stands either for “estimate” or “estimation” as context should make clear) are whatever values $p_A$ and $p_B$ that maximize $L$.

### Brief Review of Optimization Terminology

Mathematical optimization is concerned with problems of the form:

$$ \begin{array}{cl} \mbox{minimize} & f_0(x) \\\ \mbox{subject to} & f_i(x) \leq 0 \\\ & h_j(x) = 0, \end{array} $$

where $f_0$ is called the *objective function*, $f_i$ the *inequality
constraint functions* and $h_j$ the *equality constraint functions*. The
quantity or quantities $x$ are called the *optimization variables*. If $x$
satisfies the constraints it is said to be *feasible*. Whatever optimization
variable results in the objective being minimized over all feasible options, is
called a *solution*. (Solutions are not necessarily unique.) If $x^\star$ is a
solution to the problem, then $f_0(x^\star)$ is called the *value* of the
problem.

If $f( \epsilon \cdot x + (1 - \epsilon) \cdot y) \leq \epsilon \cdot f(x) + (1 - \epsilon) \cdot f(y)$
for all $x$ and $y$ in the domain of $f$ and for all $0 \leq \epsilon \leq 1$,
then $f$ is said to be *convex*. If equality holds, then $f$ is said to be
*linear*. If the objective function and inequality constraints are all convex
functions, and the equality constraints are all linear functions, then the
problem is said to be a *convex optimization problem*.

Many optimization problems encountered in practice can be expressed as convex optimization problems, and this is important because efficient algorithms exist for solving convex optimization problems, whereas optimization in general is NP Hard. Boyd and Vandenberghe is a great reference on Convex Optimization. The MLE problems we discuss here are equivalent to convex optimization problems.

### Back to Maximum Likelihood Estimation

Maximizing $L$ has the same solution as maximizing $\log L$ (the log-likelihood). The log-likelihood is: $$ \begin{align} & \log {n_A \choose s_A} + \log {n_B \choose s_B} \\ & + s_A \log p_A + (n_A - s_A) \log (1 - p_A) \\ & + s_B \log p_B + (n_B - s_B) \log (1 - p_B) \end{align} $$

The terms on the first line do not involve the optimization variables $p_A$ and $p_B$. Maximizing $f_0(x)$ has the same solution as maximizing $f_0(x) - c$, with $c$ some constant, so we can just ignore those constant terms on the first line. We therefore define: $$ \begin{align} \ell \ell (\vec{p}; \vec{s}, \vec{n}) &{:=} s_A \log p_A + (n_A - s_A) \log (1 - p_A) \\ &\phantom{:=} + s_B \log p_B + (n_B - s_B) \log (1 - p_B). \end{align} $$

Notably, $\ell \ell$ is not quite the log-likelihood because we got rid of the constant terms. Nevertheless, maximizing $\ell \ell$ has the same solution as maximizing the likelihood. Moreover, minimizing $-\ell \ell$ has the same solution as maximizing $\ell \ell$, and it can be shown that $-\ell \ell$ is a convex function of $p_A$ and $p_B$.

The way that we minimize a function (in the absence of any constraints) simply involves taking the derivative with respect to each optimization variable and setting them equal to zero. $$ \begin{align} \frac{\partial (-\ell \ell)}{\partial p_A} &= -\frac{s_A}{p_A} + \frac{n_A - s_A}{1 - p_A} = 0 \\ \frac{\partial (-\ell \ell)}{\partial p_B} &= -\frac{s_B}{p_B} + \frac{n_B - s_B}{1 - p_B} = 0 \end{align} $$ These two equations are uncoupled: the first involves only $p_A$ and the second only $p_B$. It is straightforward to solve these equations, giving $$ \begin{align} p_A^\star &= \frac{s_A}{n_A}, \\ p_B^\star &= \frac{s_B}{n_B}. \end{align} $$ These are the maximum likelihood estimates of $\pi_A$ and $\pi_B$ without any kind of constraints or hypotheses in place. (Notably, these are simply the observed success rates in the two groups, which are the obvious estimates. It is often the case that MLE has an intuitive interpretation.)

### Specifying a Null Hypothesis

We are typically interested in exploring certain relationships between $\pi_A$ and $\pi_B$. For example, we might want to understand whether $\pi_A$ and $\pi_B$ are plausibly equal. This null hypothesis can be incorporated into the MLE as follows: $$ \begin{array}{cl} \mbox{minimize} & -\ell \ell(\vec{p}; \vec{s}, \vec{n}) \\\ \mbox{subject to} & p_B - p_A = 0. \end{array} $$ This is an optimization problem with convex objective, no inequality constraints, and a single linear equality constraint, so this is still a convex optimization problem. In fact, it’s such a simple optimization problem that it can be solved by hand: $p_A^\star = p_B^\star = s / n$, where $s = s_A + s_B$ and $n = n_A + n_B$. (The derivation is left as an exercise to the reader.)

In full generality, we might be interested in null hypotheses of the form $A(\vec{p} + a) = 0$, where $A \in \mathbb{R}^{1 \times 2}$ and $a \in \mathbb{R}^{2 \times 1}$. When $a = [0 \;\; 0]^T$ and $A = [-1 \;\; 1]$, this is equivalent to $p_B - p_A = 0$. We might be interested in null hypotheses of the form $p_B - p_A = d$, corresponding to $a = [d \;\; 0]^T$ and $A = [-1 \;\; 1]$. This would be a hypothesis regarding the absolute difference in success rates between the two groups. We might be more interested in the relative lift: $p_B = p_A \cdot (1 + d)$, corresponding to $a = [0 \;\; 0]^T$ and $A = [-(1 + d) \;\; 1]$. In all of these cases, this is a linear function of $\vec{p}$ so the problem is still convex.

I have focused here on *modeling* the problem as being a convex optimization
problem, since once we have done so, computers can efficiently solve the
problem. There isn’t always a closed form solution!

For example, here is some code for solving the MLE when using a null hypothesis regarding the relative lift:

```
import cvxpy as cp
na, nb = 1000, 1000
sa, sb = 100, 110
null_lift = 0.10 # pb is 10% higher than pa
p = cp.Variable(2)
objective = cp.Maximize(
sa * cp.log(p[0]) + (na - sa) * cp.log(1 - p[0])
+ sb * cp.log(p[1]) + (nb - sb) * cp.log(1 - p[1])
)
constraints = [0 <= p, p <= 1, p[1] == p[0] * (1 + null_lift)]
prob = cp.Problem(objective, constraints)
prob.solve()
print(p.value)
```

This code clearly maps to the problem we are trying to solve, and is efficient. Nevertheless, I specialized in convex optimization in my MS program and dag nabbit! I don’t get enough opportunities to do convex optimization, so I have implemented more efficient solutions, using the above code to verify my implementation. But pride aside, really there is no reason to do so.

### Maximum Likelihood Estimates are Consistent

In future posts, we will rely on the fact that maximum likelihood estimates are consistent estimators of the parameters $\pi_A$ and $\pi_B$ under the null hypothesis. Meaning, $A (\vec{\pi} + a)$ really does equal zero. It suffices to show that $\vec{\pi}$ satisfies the optimality conditions in the limit of infinite sample size.

The optimality conditions are based on the Lagrangian function: $$ \mathcal{L}(\vec{p}, \nu) = - \ell \ell(\vec{p}; \vec{s}, \vec{n}) + \nu \cdot A(\vec{p} + a), $$ where $\nu$ is the Lagrange multiplier associated with the equality constraint. The optimality conditions say that $(\vec{p}^\star, \nu^\star)$ are the solution if: $$ \begin{align} A(\vec{p}^\star + a) &= 0 \\ \left. \frac{\partial \mathcal{L}}{\partial \vec{p}} \right|_{(\vec{p}^\star, \nu^\star)} &= 0. \end{align} $$

We will show that $(\vec{\pi}, 0)$ satisfies these conditions. The first condition is satisfied automatically since we have already assumed the null hypothesis is correct. Then all that remains is to show that the derivative of $-\ell \ell$ is zero (the second term doesn’t matter because we are claiming $\nu^\star = 0$). As derived above, these components are:

$$ \begin{align} -\frac{s_A}{\pi_A} + \frac{n_A - s_A}{1 - \pi_A} &= \frac{n_A \pi_A - s_A}{\pi_A ( 1 - \pi_A)}, \textrm{and} \\ -\frac{s_B}{\pi_B} + \frac{n_B - s_B}{1 - \pi_B} &= \frac{n_B \pi_B - s_B}{\pi_B ( 1 - \pi_B)}. \end{align} $$ Recall that $s_A \; | \; n_A \sim \textrm{Binom}(n_A, \pi_A)$, so in the limit $n_A \to \infty$, $s_A \to n_A \pi_A$, and similarly for $\pi_B$.

*Like this post? Check out the next in the series here.*