# Contingency Tables Part II: The Binomial Distribution

In our last post, we introduced the potential outcomes framework as the foundational framework for causal inference. In the potential outcomes framework, each unit (e.g. each person) is represented by a pair of outcomes, corresponding to the result of the experience provided to them (treatment or control, A or B, etc.)

Person | Control Outcome | Treatment Outcome |
---|---|---|

Alice | No Purchase | Purchase |

Brian | Purchase | Purchase |

Charlotte | Purchase | No Purchase |

David | No Purchase | No Purchase |

Table 1: Potential Outcomes

For example, in the table above, we see that Alice’s potential outcomes are:
Purchase if exposed to Treatment and No Purchase if exposed to Control. Causal
inference is fundamentally a comparison of potential outcomes: we say that the
Treatment *causes* Alice to make the purchase, because if she were to be
exposed to Control, she would not purchase. Of course, we have to make a
decision: we either expose Alice to Control or to Treatment, but we cannot do
both. If we expose Alice to Treatment, we would say the Treatment caused Alice
to purchase; if to Control, we would say the Control caused her not to
purchase.

Contrast this with Brian, whose potential outcomes are Purchase if exposed to Treatment and Purchase if exposed to Control. Since both of Brian’s potential outcomes are the same, we say there is no treatment effect.

Quickly examining the other people in the table, we see that the Treatment
*prevents* Charlotte from purchasing (and the Control causes her to purchase),
and there is no treatment effect for David. The analysis of any kind of A/B
test effectively boils down to figuring out how many people are like Alice; how
many like Brian; and so forth.

The challenge of causal inference is we can only observe what a person does in
response to the experience they receive, not what they would have done had they
received the other experience. If we expose Alice to Treatment, we see that she
purchases, but we can only speculate about what she would have done had she
been exposed to Control. We never *actually* get to see a table like the above!
Instead, we see a table like this one:

Person | Control Outcome | Treatment Outcome |
---|---|---|

Alice | ???? | Purchase |

Brian | ???? | Purchase |

Charlotte | Purchase | ???? |

David | No Purchase | ???? |

Here, Alice and Brian have been selected (perhaps randomly, perhaps not) for
Treatment, and Charlotte and David for Control. We see the corresponding
outcomes, but we do not observe the *counterfactual* outcome.

We can summarize these results in a contingency table like the one below. A contingency table is an effective way of summarizing simple experiments where the outcome is binary (e.g. purchase vs no purchase).

Experiment Group | Successes | Failures | Trials | Success Rate |
---|---|---|---|---|

Control | 1 | 1 | 2 | 50% |

Treatment | 2 | 0 | 2 | 100% |

TOTAL | 3 | 1 | 4 | 75% |

Ignoring how small the numbers are, it certainly looks like the Treatment is
better than Control: it has a 100% success rate! But when we look at the full
set of potential outcomes (which we would never be able to see in real life),
we see there is actually no real difference between Treatment and Control.
There are two people for whom there is no treatment effect; one person for whom
Treatment causes purchase; and one person for whom Control causes purchase. The
*average treatment effect* is zero! The division of the four people into two
groups has created the *illusion* of a treatment effect where really there is
none.

As a result, whenever analyzing an A/B test, we need to ask ourselves whether the data are plausibly consistent with zero treatment effect. The way that we quantify this is with a p-value. When the p-value is close to zero, that constitutes evidence that the treatment effect is not zero. A confidence interval is even more helpful: it tells us a range of treatment effects consistent with the data. Over this and the next few posts, I will teach you how to calculate a p-value and a confidence interval for this type of scenario.

In the last post, we talked about the Stable Unit Treatment Value Assumption
(SUTVA), which states that the potential outcomes of one unit do not depend on
the treatment assignment of any other unit. This assumption is often violated
in social networks, when what one person *experiences* influences what a
different person *does*. But we will assume there is no interference and SUTVA
is valid.

We will also assume that treatment assignment is *individualistic*,
*probabilistic*, and *unconfounded*, following the nomenclature of Imbens and
Rubin. “Individualistic” means units are assigned to treatment or control on
the basis of their own characteristics, not on the characteristics of any other
units. “Probabilistic” means that each unit has non-zero probability of being
assigned Treatment, and non-zero probability of being assigned Control.
“Unconfounded” means the probability of a unit being assigned Treatment (or
Control) does not depend on that unit’s potential outcomes.

One simple mechanism that guarantees these three assumptions is the Bernoulli
Trial, wherein we flip a coin for each unit in turn, and assign Treatment or
Control according to the results. Since the tosses are independent, an
individual’s assignment does not depend on anyone else’s characteristics. The
assignment mechanism is probabilistic since the coin has non-zero probability
of heads, and non-zero probability of tails (note we do not require a *fair*
coin; the probability does not have to be 50/50). Finally, by construction the
assignment mechanism is unconfounded: the probabilities do not depend on the
potential outcomes *or* the characteristics of the unit (actually, we could use
different probabilities based on observed characteristics and we could still
perform valid causal inference, but that would make this needlessly
complicated). Unfortunately, the Bernoulli Trial does *not* guarantee SUTVA,
which must typically be assessed on the basis of domain knowledge.

One disadvantage of the Bernoulli trial is the possibility that *all* units
will be assigned Treatment (or that all units be assigned Control). This is
especially problematic with small sample sizes: it is not unheard of to get
four heads in a row!

In the Completely Randomized Test, we decide in advance how many units will be assigned Treatment, and then select the appropriate number at random, as if we had written names on slips of paper and drawn them out of a hat. For large sample sizes, there is no meaningful difference between the two. In a Bernoulli Trial, the number of units exposed to Treatment is random, but we will consider the analysis conditional on this number since it is not of interest. In what follows, we will assume we are using the Completely Randomized Test (even though the Bernoulli Trial is much more common in practice).

Next, we will assume the *sharp null hypothesis* of no treatment effect for any
unit. This is a much stronger assumption than that the average treatment effect
is zero. Going back to Table 1, as long as the number of people like Alice is
exactly equal to the number of people like Charlotte, the average treatment
effect is zero. The sharp null hypothesis is that *there are no people* like
Alice and Charlotte; there are only people like Brian and David.

In this case, we can simplify the pair of potential outcomes to a single
outcome, which is the same regardless of treatment assignment. If there are $N$
units total, $K$ of whom have the (potential/actual) outcome “Purchase”, and
$n$ of whom are selected for treatment, then the number of successes
(purchases) in the treatment group has a hypergeometric
distribution with parameters $N$, $K$, and $n$. There is no approximation
or assumption here (beyond what we have already discussed); indeed, Ronald
Fisher *devised* the hypergeometric distribution specifically to describe this
scenario.

This observation can be used as the basis of a highly accurate (but computationally intensive) methodology called Fisher’s exact test, or a simulation-based alternative, both of which I have written about before. Unfortunately, the hypergeometric distribution is a little unwieldy; I am unaware of any computationally efficient methodology that uses the hypergeometric distribution directly.

Instead we often approximate the hypergeometric distribution using either the Binomial or Normal distributions. Assuming $p := K/N$ is not close to zero or one, and that the sample sizes are large, the Binomial approximation is pretty good. (If you are concerned about these assumptions, the simulation approach is your best bet. If you have been using a simple t-test all along, hopefully you now know the assumptions you’ve been making.)

Experiment Group | Successes | Failures | Trials | Success Rate |
---|---|---|---|---|

Control | $s_C$ | $f_C$ | $n_C$ | $\hat{p}_C$ |

Treatment | $s_T$ | $f_T$ | $n_T$ | $\hat{p}_T$ |

TOTAL | $K$ | $f$ | $N$ | $p$ |

When making the Binomial approximation, the corresponding assumptions are that the number of successes in the Control group, $s_C$, has a $\textrm{Binom}(n_C, p_C)$ distribution; that $s_T \sim \textrm{Binom}(n_T, p_T)$; and that $s_C$ and $s_T$ are independent. (Notably, $p_C$ is different than $\hat{p}_C$; $p_C$ is assumed non-random, but $\hat{p}_C := s_C / n_C$ is a function of $s_C$ and is therefore random.) The independence assumption should give you a little heartburn: under the sharp null hypothesis, $s_T$ is deterministically connected to $s_C$ since $s_C + s_T = K$, with $K$ being a fixed (non-random) number. It’s just an approximation that enables us quickly to calculate a p-value (and a confidence interval). If it bothers you, the simulation-based approach works well. But with large sample sizes, it’s a fine approximation. The null hypothesis of no treatment effect becomes the null hypothesis that $p_C = p_T$.

Knowing the (approximate) distribution of the entries of the contingency table is the first step towards calculating p-values. It will also enable us to calculate the sample sizes required to achieve a desired sensitivity. In our next post, we will apply Maximum Likelihood Estimation to estimate $p_C$ and $p_T$ under the conditions of the null and alternative hypotheses. These estimates form the basis of three approaches for calculating p-values and confidence intervals: the Likelihood Ratio Test, the Wald Test, and the Score Test (my preferred option). Subscribe to my newsletter to be alerted when I publish these posts!

*Like this post? Check out the next in the series here.*

## References

Guido W. Imbens and Donald B. Rubin, *Causal Inference for Statistics, Social,
and Biomedical Sciences*. Cambridge University Press, 2015.