Contingency Tables Part II: The Binomial Distribution
In our last post, we introduced the potential outcomes framework as the foundational framework for causal inference. In the potential outcomes framework, each unit (e.g. each person) is represented by a pair of outcomes, corresponding to the result of the experience provided to them (treatment or control, A or B, etc.)
Person | Control Outcome | Treatment Outcome |
---|---|---|
Alice | No Purchase | Purchase |
Brian | Purchase | Purchase |
Charlotte | Purchase | No Purchase |
David | No Purchase | No Purchase |
Table 1: Potential Outcomes
For example, in the table above, we see that Alice’s potential outcomes are: Purchase if exposed to Treatment and No Purchase if exposed to Control. Causal inference is fundamentally a comparison of potential outcomes: we say that the Treatment causes Alice to make the purchase, because if she were to be exposed to Control, she would not purchase. Of course, we have to make a decision: we either expose Alice to Control or to Treatment, but we cannot do both. If we expose Alice to Treatment, we would say the Treatment caused Alice to purchase; if to Control, we would say the Control caused her not to purchase.
Contrast this with Brian, whose potential outcomes are Purchase if exposed to Treatment and Purchase if exposed to Control. Since both of Brian’s potential outcomes are the same, we say there is no treatment effect.
Quickly examining the other people in the table, we see that the Treatment prevents Charlotte from purchasing (and the Control causes her to purchase), and there is no treatment effect for David. The analysis of any kind of A/B test effectively boils down to figuring out how many people are like Alice; how many like Brian; and so forth.
The challenge of causal inference is we can only observe what a person does in response to the experience they receive, not what they would have done had they received the other experience. If we expose Alice to Treatment, we see that she purchases, but we can only speculate about what she would have done had she been exposed to Control. We never actually get to see a table like the above! Instead, we see a table like this one:
Person | Control Outcome | Treatment Outcome |
---|---|---|
Alice | ???? | Purchase |
Brian | ???? | Purchase |
Charlotte | Purchase | ???? |
David | No Purchase | ???? |
Here, Alice and Brian have been selected (perhaps randomly, perhaps not) for Treatment, and Charlotte and David for Control. We see the corresponding outcomes, but we do not observe the counterfactual outcome.
We can summarize these results in a contingency table like the one below. A contingency table is an effective way of summarizing simple experiments where the outcome is binary (e.g. purchase vs no purchase).
Experiment Group | Successes | Failures | Trials | Success Rate |
---|---|---|---|---|
Control | 1 | 1 | 2 | 50% |
Treatment | 2 | 0 | 2 | 100% |
TOTAL | 3 | 1 | 4 | 75% |
Ignoring how small the numbers are, it certainly looks like the Treatment is better than Control: it has a 100% success rate! But when we look at the full set of potential outcomes (which we would never be able to see in real life), we see there is actually no real difference between Treatment and Control. There are two people for whom there is no treatment effect; one person for whom Treatment causes purchase; and one person for whom Control causes purchase. The average treatment effect is zero! The division of the four people into two groups has created the illusion of a treatment effect where really there is none.
As a result, whenever analyzing an A/B test, we need to ask ourselves whether the data are plausibly consistent with zero treatment effect. The way that we quantify this is with a p-value. When the p-value is close to zero, that constitutes evidence that the treatment effect is not zero. A confidence interval is even more helpful: it tells us a range of treatment effects consistent with the data. Over this and the next few posts, I will teach you how to calculate a p-value and a confidence interval for this type of scenario.
In the last post, we talked about the Stable Unit Treatment Value Assumption (SUTVA), which states that the potential outcomes of one unit do not depend on the treatment assignment of any other unit. This assumption is often violated in social networks, when what one person experiences influences what a different person does. But we will assume there is no interference and SUTVA is valid.
We will also assume that treatment assignment is individualistic, probabilistic, and unconfounded, following the nomenclature of Imbens and Rubin. “Individualistic” means units are assigned to treatment or control on the basis of their own characteristics, not on the characteristics of any other units. “Probabilistic” means that each unit has non-zero probability of being assigned Treatment, and non-zero probability of being assigned Control. “Unconfounded” means the probability of a unit being assigned Treatment (or Control) does not depend on that unit’s potential outcomes.
One simple mechanism that guarantees these three assumptions is the Bernoulli Trial, wherein we flip a coin for each unit in turn, and assign Treatment or Control according to the results. Since the tosses are independent, an individual’s assignment does not depend on anyone else’s characteristics. The assignment mechanism is probabilistic since the coin has non-zero probability of heads, and non-zero probability of tails (note we do not require a fair coin; the probability does not have to be 50/50). Finally, by construction the assignment mechanism is unconfounded: the probabilities do not depend on the potential outcomes or the characteristics of the unit (actually, we could use different probabilities based on observed characteristics and we could still perform valid causal inference, but that would make this needlessly complicated). Unfortunately, the Bernoulli Trial does not guarantee SUTVA, which must typically be assessed on the basis of domain knowledge.
One disadvantage of the Bernoulli trial is the possibility that all units will be assigned Treatment (or that all units be assigned Control). This is especially problematic with small sample sizes: it is not unheard of to get four heads in a row!
In the Completely Randomized Test, we decide in advance how many units will be assigned Treatment, and then select the appropriate number at random, as if we had written names on slips of paper and drawn them out of a hat. For large sample sizes, there is no meaningful difference between the two. In a Bernoulli Trial, the number of units exposed to Treatment is random, but we will consider the analysis conditional on this number since it is not of interest. In what follows, we will assume we are using the Completely Randomized Test (even though the Bernoulli Trial is much more common in practice).
Next, we will assume the sharp null hypothesis of no treatment effect for any unit. This is a much stronger assumption than that the average treatment effect is zero. Going back to Table 1, as long as the number of people like Alice is exactly equal to the number of people like Charlotte, the average treatment effect is zero. The sharp null hypothesis is that there are no people like Alice and Charlotte; there are only people like Brian and David.
In this case, we can simplify the pair of potential outcomes to a single outcome, which is the same regardless of treatment assignment. If there are $N$ units total, $K$ of whom have the (potential/actual) outcome “Purchase”, and $n$ of whom are selected for treatment, then the number of successes (purchases) in the treatment group has a hypergeometric distribution with parameters $N$, $K$, and $n$. There is no approximation or assumption here (beyond what we have already discussed); indeed, Ronald Fisher devised the hypergeometric distribution specifically to describe this scenario.
This observation can be used as the basis of a highly accurate (but computationally intensive) methodology called Fisher’s exact test, or a simulation-based alternative, both of which I have written about before. Unfortunately, the hypergeometric distribution is a little unwieldy; I am unaware of any computationally efficient methodology that uses the hypergeometric distribution directly.
Instead we often approximate the hypergeometric distribution using either the Binomial or Normal distributions. Assuming $p := K/N$ is not close to zero or one, and that the sample sizes are large, the Binomial approximation is pretty good. (If you are concerned about these assumptions, the simulation approach is your best bet. If you have been using a simple t-test all along, hopefully you now know the assumptions you’ve been making.)
Experiment Group | Successes | Failures | Trials | Success Rate |
---|---|---|---|---|
Control | $s_C$ | $f_C$ | $n_C$ | $\hat{p}_C$ |
Treatment | $s_T$ | $f_T$ | $n_T$ | $\hat{p}_T$ |
TOTAL | $K$ | $f$ | $N$ | $p$ |
When making the Binomial approximation, the corresponding assumptions are that the number of successes in the Control group, $s_C$, has a $\textrm{Binom}(n_C, p_C)$ distribution; that $s_T \sim \textrm{Binom}(n_T, p_T)$; and that $s_C$ and $s_T$ are independent. (Notably, $p_C$ is different than $\hat{p}_C$; $p_C$ is assumed non-random, but $\hat{p}_C := s_C / n_C$ is a function of $s_C$ and is therefore random.) The independence assumption should give you a little heartburn: under the sharp null hypothesis, $s_T$ is deterministically connected to $s_C$ since $s_C + s_T = K$, with $K$ being a fixed (non-random) number. It’s just an approximation that enables us quickly to calculate a p-value (and a confidence interval). If it bothers you, the simulation-based approach works well. But with large sample sizes, it’s a fine approximation. The null hypothesis of no treatment effect becomes the null hypothesis that $p_C = p_T$.
Knowing the (approximate) distribution of the entries of the contingency table is the first step towards calculating p-values. It will also enable us to calculate the sample sizes required to achieve a desired sensitivity. In our next post, we will apply Maximum Likelihood Estimation to estimate $p_C$ and $p_T$ under the conditions of the null and alternative hypotheses. These estimates form the basis of three approaches for calculating p-values and confidence intervals: the Likelihood Ratio Test, the Wald Test, and the Score Test (my preferred option). Subscribe to my newsletter to be alerted when I publish these posts!
Like this post? Check out the next in the series here.
References
Guido W. Imbens and Donald B. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.