Fisher's Exact Test
A/B Testing Series
- Random Sampling
- Statistical Significance
- Fisher's Exact Test
- Counterfactuals and Causal Reasoning
- Statistical Power
- Confidence Intervals
Introduction
In the last post, we explored simulation as a way of computing p-values and assessing statistical significance. Once we have decided on a Null Hypothesis and a definition of outcome extremity, we repeatedly simulate the sampling procedure under the conditions of the Null Hypothesis. We keep track of the fraction of simulations leading to an outcome at least as extreme as what was actually observed. The resulting fraction is called the p-value, which quantifies how incompatible the observed result is with the Null Hypothesis. The smaller the p-value, the more inclined we are to dismiss the Null Hypothesis.
The method, although intuitive, may seem ad hoc. In this post, we will consider an alternative approach justifying simulation as a reliable procedure. The method was devised by Ronald Fisher—considered by many to be the father of modern statistics—and, where applicable, gives exact results. It is therefore appropriately named Fisher’s Exact Test.
Let’s continue with our email example. We have a thousand email recipients. We randomly chose five hundred of them to receive one version of a subject line (subject line A); the remaining five hundred received a second version (subject line B). Out of the five hundred people who received subject line A, 100 opened the email. Out of the five hundred people who received subject line B, 105 opened the email. At face value, it appears that subject line B is slightly better than subject line A since more people opened that email. It is possible, however, that the subject line is a red herring. Across both groups, 205 people opened emails altogether; perhaps those 205 people would have opened whatever email they received. It just so happened that 105 of those people wound up in the second group because of the way we assigned subject lines to people.
When it concerns human motivations, it can be a little hard to accept that we aren’t influencing behavior, so let’s consider an analogous situation involving marbles. Imagine we have one thousand marbles in an urn. 205 of the marbles are blue; the remaining 795 are red. The blue marbles correspond to people who love opening emails; the red marbles to people who hate opening emails. We pick 500 marbles out of the urn, and 100 of them are blue. The remaining 105 blue marbles are, of course, still in the urn. The fact that there are more blue marbles in the urn than in our sample isn’t because of anything, other than random chance. Similarly, under the Null Hypothesis, we assume the reason more people opened the email in the second group isn’t because of anything, other than random chance.
We will explore Fisher’s Exact Test in the context of the marble example, but the results apply identically to the motivating email example. We need to consider the possible outcomes and assess their likelihood. When we pick 500 marbles out of the urn, it is possible that all of them are red. In that case, all of the blue marbles are still in the urn. It is also possible that all 205 blue marbles are removed from the urn. In that case, there would zero blue marbles remaining in the urn. These are the most extreme outcomes possible, and they seem pretty unlikely. (To be clear, if we pick 500 marbles out of the urn, it is not possible that they would all be blue, since there weren’t 500 blue marbles in the urn to begin with.)
The hypergeometric distribution allows us to quantify the likelihood of
any particular outcome. The probability of picking 500 marbles from
the urn, all of them red, is about one in $10^{73}$. The probability
of picking 500 marbles, 205 of them blue, is (perhaps unsurprisingly)
just as unlikely. This probability is easily calculated in Python
using a function from scipy.stats
:
>>> from scipy.stats import hypergeom
>>> M = 1000
>>> n = 205
>>> N = 500
>>> hypergeom.pmf(0, M, n, N)
5.547423571847365e-74
>>> hypergeom.pmf(205, M, n, N)
5.547423571847365e-74
>>> hypergeom.pmf(100, M, n, N)
0.05782914223328722
The last line shows the probability of drawing exactly 100 blue marbles, which of course is what we actually observed. Now that we can compute the probability of any outcome, we can determine the probability of an outcome at least as extreme as what was observed. Drawing zero blue marbles is definitely “at least as extreme” as drawing 100 blue marbles. So is drawing 205 blue marbles. Any outcome in which the difference in blue marble counts, between the group removed from the urn and the group left in the urn, is greater than or equal to the observed difference ($105-100=5$), qualifies as “at least as extreme”. Drawing 99 blue marbles (leaving 106 in the urn) qualifies, as does drawing 105 blue marbles (leaving 100 in the urn). Drawing 102 blue marbles (leaving 103 in the urn) is less extreme.
Fisher’s Exact Test adds up the probabilities of each outcome at least as extreme as what was observed, to get the exact p-value. This is easily accomplished in Python with the below code.
>>> pval = 0.
>>> for i in range(0, 101):
... pval += hypergeom.pmf(i, M, n, N)
>>> for i in range(105, 206):
... pval += hypergeom.pmf(i, M, n, N)
>>> print(pval)
0.754088021026
Our simulation results from the last post match the exact p-value computed above pretty closely, and that is no coincidence. In fact, the simulation approach described in the last post is an approximation to Fisher’s Exact Test. Instead of adding up the exact probabilities, our simulations are randomly selected from the sampling distribution, giving a result that asymptotically converges to the correct value as the number of simulations grows to infinity. This allows us to state precisely how well the simulation approach works. Perhaps more importantly, it allows us to determine how many simulations are required to guarantee a desired accuracy.
Why the Simulation Approach Works
Suppose we are repeatedly simulating the sampling mechanism under the conditions of the Null Hypothesis. Let $p^{(i)}$ equal $1$ if the $i$th simulation has an outcome as least as extreme as what was actually observed, and $0$ otherwise. With $B$ simulations, the p-value is estimated as $\hat{p} = \frac{1}{B} \sum_{i=1}^B p^{(i)}$. Since each simulation is random, $p^{(i)}$ is a random variable; $\hat{p}$ is also random since it is a sum of random variables. By design, each simulation is identical and independent. The probability that $p^{(i)}=1$ in any simulation is equal to the p-value by virtue of the definition of the p-value. Each $p^{(i)}$ is thus a Bernoulli random variable, with $\operatorname{E}[p^{(i)}] = p$ and $\operatorname{Var}(p^{(i)}) = p\cdot(1-p)$, where $p$ is the (true) p-value.
According to the Law of Large Numbers, $\hat{p}$ converges almost surely to the true p-value as the number of simulations tends to infinity. We can also see this by applying the formulae for the expected value and variance of sums of independent random variables, giving $\operatorname{E}[ \hat{p} ] = p$ and $\operatorname{Var}(\hat{p}) = \frac{p \cdot (1-p)}{B}$. In words, the simulated p-value is an unbiased estimate of the true p-value, and as the number of simulations increases, its variance decreases to 0.
Moreover, since $\hat{p}$ is the average of Independent, Identically Distributed (IID) random variables, the Central Limit Theorem states that its limiting distribution is Gaussian. Specifically:
$$\sqrt{B} \cdot \frac{\hat{p} - p}{\sqrt{p \cdot (1 - p)}} \xrightarrow[]{d} N(0, 1).$$
Since a standard Gaussian random variable falls between $-1.96$ and $+1.96$ with probability $0.95$, we can use the distributional result above to give the endpoints of a $95\%$ confidence interval on the true p-value: $$\hat{p} \pm z_{0.95} \cdot \sqrt{\frac{p \cdot (1 - p)}{B}},$$
where $z_{0.95} \approx 1.96$. The astute reader will notice that we need the true p-value to compute the confidence interval. There are better approaches to confidence intervals; the point is that we can achieve any accuracy we want simply by making the number of simulations large enough.
We now see that the simulation procedure sits on a firm statistical foundation as an approximation to Fisher’s Exact Test. It is applicable to a much wider range of experiments, however; for situations where the sampling procedure is more exotic than can be described by the hypergeometric or binomial distributions, we can still compute (approximate) p-values through simulation.