Confidence Intervals
A/B Testing Series
- Random Sampling
- Statistical Significance
- Fisher's Exact Test
- Counterfactuals and Causal Reasoning
- Statistical Power
- Confidence Intervals
Introduction
In the 21st century, Data Science has emerged as a powerful tool for decision making, often enabling us to predict the consequences of any particular course of action. Data Science builds on statistical techniques developed in the 20th century, yet accomplishes what was not previously possible, through two key advancements. Firstly, data are now being collected with unprecedented scale, reliability, and detail. Secondly, as incredible computational power becomes readily and cheaply available, those data can be analyzed more quickly and more thoroughly than ever before.
Examining the claims made by many analytics companies, it is tempting to believe we live in an age where uncertainty has been eliminated. Yet this is both an exaggeration and an incomplete picture of the goal of Data Science, which might more aptly be described as the management of uncertainty, rather than the elimination thereof. Statistics does indeed allow us to reduce uncertainty, but just as importantly it allows us to quantify residual uncertainty so that we may mitigate its effect on decision making. Through these techniques, we can often make good decisions even when considerable uncertainty remains. In this article, we will consider an important technique illustrating these principles, called a confidence interval.
Confidence Intervals for Treatment Effects
Let’s continue analyzing our email example. We randomly split an audience of $N=1000$ in half. Five hundred people receive version $A$ of a subject line; five hundred receive version $B$. Suppose 100 people who receive subject line $A$ open the email and 130 people who receive subject line $B$ open the email. Taking these numbers at face value, it would appear that subject line $B$ is $30\%$ more successful at getting people to open the email, but from our previous discussion we know that random segmentation can create the appearance of a treatment effect where there is none.
Using the techniques discussed in our article on Statistical Significance, we can compute the p-value, the probability of an outcome at least as extreme as what we observed assuming there is no treatment effect. There is more than one reasonable definition of “at least as extreme”, and different definitions will generally lead to different p-values. For this discussion we will use the absolute difference in open rates, which has the advantage of treating each subject line interchangeably. With this definition, the observed outcome extremity is $|20\% - 26\%| = 6\%$. Assuming there is no treatment effect, 230 people were going to open the email regardless of how we assigned subject lines to recipients. Depending on how we segment these users, we might wind up with 115 openers in each group ($0\%$ outcome extremity), or we might wind up with all 230 in one group ($46\%$ outcome extremity). The probability of observing at least a $6\%$ outcome extremity in this scenario is about $2.9\%$, so the p-value is $0.029$.
The p-value is a way of quantifying the strength of evidence against the scenario (called the Null Hypothesis) used to compute it. In this case, the Null Hypothesis is that there is no treatment effect. The smaller the p-value, the less consistent the data are with the Null Hypothesis. At some point, we decide the data are too inconsistent with the Null Hypothesis for it to be plausible, and at that point we reject it. A common threshold is $0.05$. Our p-value of $0.029$ is less than this threshold, so we would conclude the data provide statistically significant evidence of some kind of treatment effect. That doesn’t mean we then accept the observed treatment effect as the only remaining possibility! The randomization procedure can still exaggerate or conceal the actual treatment effect. That is, the true impact of subject line $B$ might be higher or lower than the observed $30\%$.
To see how this can happen, consider the causal model we have discussed previously. Suppose our audience of one thousand recipients can be divided into four categories:
- Those who will not open the email regardless of whether they receive subject lines $A$ or $B$.
- Those who will open the email regardless of the subject line they receive.
- Those who will open the email if they receive subject line $A$, but not if they receive subject line $B$.
- Those who will open the email if they receive subject line $B$, but not if they receive subject line $A$.
The behaviors of the recipients in the first two categories are not influenced by the subject line they receive. The response for people in the third category is affected positively by subject line $A$, or, equivalently, negatively by subject line $B$. Likewise, the response for people in the fourth category is affected negatively by subject line $A$ and positively by subject line $B$. If there are more recipients in the fourth category than in the third category, then subject line $B$ has a positive impact on the overall response. Let $Q_i$ be the number of people in category $i$ above. For example, suppose there are 700 people in the first category, 150 people in the second category, 50 people in the third category, and 100 people in the fourth. Then $Q_1 = 700$, $Q_2 = 150$, $Q_3 = 50$, and $Q_4 = 100$. In this example, there are more people positively affected by subject line $B$ than subject line $A$.
If we sent subject line $A$ to everyone, $Q_2 + Q_3 = 200$ people would open the email (but only $Q_3 = 50$ would open it because they received subject line $A$, considering our discussion of what “why” means). If we sent subject line $B$ to everyone, $Q_2 + Q_4 = 250$ people would open the email. In this scenario, the impact of subject line $B$ is a $25\%$ increase in email opens. We call this the population treatment effect, and is the true measure of the relative impact of subject lines $A$ and $B$. In general, the population treatment effect is
$$T_P = \frac{Q_4 - Q_3}{Q_2 + Q_3}.$$
But in order to calculate this impact, we first need to know how many people are in each category above. We need to know not only how many people react positively to the subject line they receive, but also to the subject line they did not receive, which is typically not possible. That is precisely why we use random segmentation to decide who receives which subject line.
Suppose a random sample of $n = 500$ recipients consists of 346 people from the first category, 79 from the second, 31 from the third, and 44 from the fourth. We will refer to these sample demographics as $q_i^A$, so for example $q_1^A = 346$. Suppose we send subject line $A$ to this group, which we will call group $A$. We send subject line $B$ to everyone else (group $B$). How many people open emails in each group? In group $A$, there are $q_2^A + q_3^A = 110$ users who open the email, 31 of whom open it because they received subject line $A$, and 79 who would have opened regardless of the subject line they received.
The people in the second category not in group $A$ are necessarily in group $B$, so there are $Q_2 - q_2^A = 71$ people from the second category who receive subject line $B$. Of course, all of them open the email. Likewise, the $Q_4 - q_4^A = 56$ people in the fourth category who were not in group $A$ are necessarily in group $B.$ These 56 people open the email because they received subject line $B$. In total, 127 people from group $B$ open the email, compared to 110 in group $A$, a $15\%$ increase in email opens relative to group $A$. This is called the sample or observed treatment effect. The observed treatment effect is:
$$T_{\mathcal{O}} = \frac{\frac{Q_2 - q_2^A + Q_4 - q_4^A}{N - n} - \frac{q_2^A + q_3^A}{n}}{\frac{q_2^A + q_3^A}{n}}.$$
The observed treatment effect ($15\%$) is quite a bit lower than the population treatment effect ($25\%$). That’s because, by chance, more of the people from categories 2 and 3 wound up in group $A$ than in group $B$, so the performance of subject line $A$ was inflated. The observed treatment effect is partially influenced by the numbers of people in the different categories, but it is also influenced by the random segmentation. The observed treatment effect is sometimes lower than the population treatment effect, and sometimes higher. But the observed treatment effect is always a coincidence. It is only useful to the extent that it allows us to estimate the population treatment effect, which is unobservable in practice.
Specifying the numbers of people in each category, $\{Q_i\}$, is a hypothesis against which a p-value may be calculated for any observed outcome extremity. Concretely, suppose we treat: $Q_1=700$, $Q_2=150$, $Q_3=50$, and $Q_4=100$, as a null hypothesis. The observed outcome is 100 email opens in group $A$ and 130 in group $B$, just like before; however, for this null hypothesis a different definition of outcome extremity is warranted. We expect a difference in open rates of about $5\%$, since in this null hypothesis, the open rates are $20\%$ and $25\%$ for subject lines $A$ and $B$, respectively. So the most natural definition of outcome extremity is $|r_B - r_A - 5\%|$, where $r_A$ is the open rate associated with subject line $A$ and similarly for $r_B$. Thus the observed outcome extremity is $|26\% - 20\% - 5\%| = 1\%$. Moreover, because the null hypothesis is different, the way we compute the p-value is a little different. By repeatedly simulating the sampling procedure in the context of this more exotic null hypothesis, we can keep track of the fraction of simulations leading to a result at least as extreme as what was actually observed. This is an estimate of the p-value.
Using the python code below (or see the accompanying script in Github), we compute a p-value of 0.68. Since this is above our threshold of $0.05$, the data are consistent with this null hypothesis. We could repeat this process for any null hypothesis of the form considered, computing p-values for each against the same observed outcome, and deciding whether the data are consistent or inconsistent with each null hypothesis using a particular p-value threshold.
import numpy as np
def multivariate_hypergeometric(Q, n, B=1):
# Strategy from https://stackoverflow.com/questions/35734026/numpy-drawing-from-urn
q = np.zeros((len(Q),B))
for j in range(B):
for i in range(len(Q)-1):
if sum(q[:, j]) == n:
break
q[i, j] = np.random.hypergeometric(Q[i], sum(Q[(i+1):]), n - sum(q[:, j]))
else:
q[len(Q)-1, j] = n - sum(q[:, j])
return q
def stat_sig_multihypergeometric(na, nb, ea, eb, null_hypothesis, B=1000000):
N = sum(null_hypothesis)
expected = (null_hypothesis[1] + null_hypothesis[3]) / N
expected -= (null_hypothesis[1] + null_hypothesis[2]) / N
oe_obs = np.abs(eb / nb - ea / na - expected)
q = multivariate_hypergeometric(null_hypothesis, na, B=B)
sa = q[1, :] + q[2, :]
sb = (null_hypothesis[1] - q[1, :]) + (null_hypothesis[3] - q[3, :])
z = np.abs(sb / nb - sa / na - expected)
pval = np.mean(z >= oe_obs)
return pval
null_hypothesis = [700, 150, 50, 100]
na = 500
nb = 500
ea = 100
eb = 130
pval = stat_sig_multihypergeometric(na, nb, ea, eb, null_hypothesis)
print(pval)
Different null hypotheses of the form considered above can have the same corresponding population treatment effect. Typically we are more concerned with the treatment effect than with the exact details of the null hypothesis that gives rise to it. For example, we would typically specify the null hypothesis in terms of the population treatment effect, even knowing that there is more than one configuration, $\{Q_i\}$, represented by this null hypothesis. The p-value of any data set against this “treatment effect null hypothesis” is the maximum over all p-values computed for “configuration null hypotheses” having the same treatment effect. In our example, the population treatment effect is $25\%$. Imagine computing p-values for each null hypothesis $\{Q_i\}$ having $\frac{Q_4 - Q_3}{Q_2 + Q_3} = 25\%$. The largest p-value is the one we use for deciding whether the hypothesized treatment effect is consistent with the data.
If all p-values corresponding to a particular treatment effect are below the threshold used for significance, we can rule out that treatment effect as being inconsistent with the observed data. If any p-values are above that threshold, we deem the corresponding treatment effect to be consistent with the data. The set of all treatment effects consistent with the data is called a confidence interval on the true (unknown) treatment effect. When using a p-value threshold of $\alpha$, the resulting confidence interval is said to have coverage $100(1-\alpha)\%$. For example, when using $\alpha=0.05$, we say the confidence interval has coverage $95\%$. We describe this as a $95\%$ confidence interval. The smaller the p-value threshold we use for deciding statistical significance, the fewer null hypotheses will be consistent with the data, leading to smaller confidence intervals.
In general, computing confidence intervals requires computing p-values for every set of $\{Q_i\}$ corresponding to a range of treatment effects. I don’t think anyone actually does this; even with modern computers that would take too long! Instead, we use approximations based on assumptions of independence or normal distributions. These give a fast way of calculating intervals that are typically valid provided sample sizes are large enough. These approximations often conceal what we are really trying to calculate: a set of treatment effects consistent with the data! When analyzing any A/B test or other experiment, we need to understand what conclusions are not ruled out by the data, and confidence intervals are one way of doing that.