# Confidence Intervals

## A/B Testing Series

- Random Sampling
- Statistical Significance
- Fisher's Exact Test
- Counterfactuals and Causal Reasoning
- Statistical Power
- Confidence Intervals

## Introduction

In the 21st century, Data Science has emerged as a powerful tool for decision making, often enabling us to predict the consequences of any particular course of action. Data Science builds on statistical techniques developed in the 20th century, yet accomplishes what was not previously possible, through two key advancements. Firstly, data are now being collected with unprecedented scale, reliability, and detail. Secondly, as incredible computational power becomes readily and cheaply available, those data can be analyzed more quickly and more thoroughly than ever before.

Examining the claims made by many analytics companies, it is tempting
to believe we live in an age where uncertainty has been
eliminated. Yet this is both an exaggeration and an incomplete picture
of the goal of Data Science, which might more aptly be described as
the *management* of uncertainty, rather than the elimination
thereof. Statistics does indeed allow us to *reduce* uncertainty, but
just as importantly it allows us to *quantify* residual uncertainty so
that we may *mitigate* its effect on decision making. Through these
techniques, we can often make good decisions even when considerable
uncertainty remains. In this article, we will consider an important
technique illustrating these principles, called a *confidence interval*.

# Confidence Intervals for Treatment Effects

Let’s continue analyzing our email example. We randomly split an audience of $N=1000$ in half. Five hundred people receive version $A$ of a subject line; five hundred receive version $B$. Suppose 100 people who receive subject line $A$ open the email and 130 people who receive subject line $B$ open the email. Taking these numbers at face value, it would appear that subject line $B$ is $30\%$ more successful at getting people to open the email, but from our previous discussion we know that random segmentation can create the appearance of a treatment effect where there is none.

Using the techniques discussed in our article on Statistical Significance, we can compute the p-value, the probability of an outcome at least as extreme as what we observed assuming there is no treatment effect. There is more than one reasonable definition of “at least as extreme”, and different definitions will generally lead to different p-values. For this discussion we will use the absolute difference in open rates, which has the advantage of treating each subject line interchangeably. With this definition, the observed outcome extremity is $|20\% - 26\%| = 6\%$. Assuming there is no treatment effect, 230 people were going to open the email regardless of how we assigned subject lines to recipients. Depending on how we segment these users, we might wind up with 115 openers in each group ($0\%$ outcome extremity), or we might wind up with all 230 in one group ($46\%$ outcome extremity). The probability of observing at least a $6\%$ outcome extremity in this scenario is about $2.9\%$, so the p-value is $0.029$.

The p-value is a way of quantifying the strength of evidence against the scenario (called the Null Hypothesis) used to compute it. In this case, the Null Hypothesis is that there is no treatment effect. The smaller the p-value, the less consistent the data are with the Null Hypothesis. At some point, we decide the data are too inconsistent with the Null Hypothesis for it to be plausible, and at that point we reject it. A common threshold is $0.05$. Our p-value of $0.029$ is less than this threshold, so we would conclude the data provide statistically significant evidence of some kind of treatment effect. That doesn’t mean we then accept the observed treatment effect as the only remaining possibility! The randomization procedure can still exaggerate or conceal the actual treatment effect. That is, the true impact of subject line $B$ might be higher or lower than the observed $30\%$.

To see how this can happen, consider the causal model we have discussed previously. Suppose our audience of one thousand recipients can be divided into four categories:

- Those who will not open the email regardless of whether they receive subject lines $A$ or $B$.
- Those who will open the email regardless of the subject line they receive.
- Those who will open the email if they receive subject line $A$, but not if they receive subject line $B$.
- Those who will open the email if they receive subject line $B$, but not if they receive subject line $A$.

The behaviors of the recipients in the first two categories are not influenced by the subject line they receive. The response for people in the third category is affected positively by subject line $A$, or, equivalently, negatively by subject line $B$. Likewise, the response for people in the fourth category is affected negatively by subject line $A$ and positively by subject line $B$. If there are more recipients in the fourth category than in the third category, then subject line $B$ has a positive impact on the overall response. Let $Q_i$ be the number of people in category $i$ above. For example, suppose there are 700 people in the first category, 150 people in the second category, 50 people in the third category, and 100 people in the fourth. Then $Q_1 = 700$, $Q_2 = 150$, $Q_3 = 50$, and $Q_4 = 100$. In this example, there are more people positively affected by subject line $B$ than subject line $A$.

If we sent subject line $A$ to everyone, $Q_2 + Q_3 = 200$ people
would open the email (but only $Q_3 = 50$ would open it *because* they
received subject line $A$, considering our discussion of
what “why” means).
If we sent subject line $B$ to everyone, $Q_2 + Q_4 = 250$ people
would open the email. In this scenario, the impact of subject line $B$
is a $25\%$ increase in email opens. We call this the *population
treatment effect*, and is the true measure of the relative impact of
subject lines $A$ and $B$. In general, the population treatment effect is

$$T_P = \frac{Q_4 - Q_3}{Q_2 + Q_3}.$$

But in order to calculate this impact, we first need to know how many people are in each category above. We need to know not only how many people react positively to the subject line they receive, but also to the subject line they did not receive, which is typically not possible. That is precisely why we use random segmentation to decide who receives which subject line.

Suppose a random sample of $n = 500$ recipients consists of 346 people
from the first category, 79 from the second, 31 from the third, and 44
from the fourth. We will refer to these sample demographics as
$q_i^A$, so for example $q_1^A = 346$. Suppose we send subject line
$A$ to this group, which we will call group $A$. We send subject line
$B$ to everyone else (group $B$). How many people open emails in each
group? In group $A$, there are $q_2^A + q_3^A = 110$ users who open
the email, 31 of whom open it *because* they received subject line
$A$, and 79 who would have opened regardless of the subject line they
received.

The people in the second category not in group $A$ are necessarily in
group $B$, so there are $Q_2 - q_2^A = 71$ people from the second
category who receive subject line $B$. Of course, all of them open the
email. Likewise, the $Q_4 - q_4^A = 56$ people in the fourth category
who were not in group $A$ are necessarily in group $B.$ These 56
people open the email *because* they received subject line $B$. In
total, 127 people from group $B$ open the email, compared to 110 in
group $A$, a $15\%$ increase in email opens relative to group
$A$. This is called the *sample* or *observed treatment effect*. The
observed treatment effect is:

$$T_{\mathcal{O}} = \frac{\frac{Q_2 - q_2^A + Q_4 - q_4^A}{N - n} - \frac{q_2^A + q_3^A}{n}}{\frac{q_2^A + q_3^A}{n}}.$$

The observed treatment effect ($15\%$) is quite a bit lower than the
population treatment effect ($25\%$). That’s because, by chance, more
of the people from categories 2 and 3 wound up in group $A$ than in
group $B$, so the performance of subject line $A$ was inflated. The
observed treatment effect is partially influenced by the numbers of
people in the different categories, but it is also influenced by the
random segmentation. The observed treatment effect is sometimes lower
than the population treatment effect, and sometimes higher. But the
observed treatment effect is always a *coincidence*. It is only useful
to the extent that it allows us to estimate the population treatment
effect, which is unobservable in practice.

Specifying the numbers of people in each category, $\{Q_i\}$, is a
hypothesis against which a p-value may be calculated for any observed
outcome extremity. Concretely, suppose we treat: $Q_1=700$, $Q_2=150$,
$Q_3=50$, and $Q_4=100$, as a null hypothesis. The observed outcome is
100 email opens in group $A$ and 130 in group $B$, just like before;
however, for this null hypothesis a different definition of outcome
extremity is warranted. We *expect* a difference in open rates of
about $5\%$, since in this null hypothesis, the open rates are $20\%$
and $25\%$ for subject lines $A$ and $B$, respectively. So the most
natural definition of outcome extremity is $|r_B - r_A - 5\%|$, where
$r_A$ is the open rate associated with subject line $A$ and similarly
for $r_B$. Thus the observed outcome extremity is $|26\% - 20\% - 5\%|
= 1\%$. Moreover, because the null hypothesis is different, the way we
compute the p-value is a little different. By repeatedly simulating
the sampling procedure in the context of this more exotic null
hypothesis, we can keep track of the fraction of simulations leading
to a result at least as extreme as what was actually observed. This is
an estimate of the p-value.

Using the python code below (or see the accompanying script in Github), we compute a p-value of 0.68. Since this is above our threshold of $0.05$, the data are consistent with this null hypothesis. We could repeat this process for any null hypothesis of the form considered, computing p-values for each against the same observed outcome, and deciding whether the data are consistent or inconsistent with each null hypothesis using a particular p-value threshold.

```
import numpy as np
def multivariate_hypergeometric(Q, n, B=1):
# Strategy from https://stackoverflow.com/questions/35734026/numpy-drawing-from-urn
q = np.zeros((len(Q),B))
for j in range(B):
for i in range(len(Q)-1):
if sum(q[:, j]) == n:
break
q[i, j] = np.random.hypergeometric(Q[i], sum(Q[(i+1):]), n - sum(q[:, j]))
else:
q[len(Q)-1, j] = n - sum(q[:, j])
return q
def stat_sig_multihypergeometric(na, nb, ea, eb, null_hypothesis, B=1000000):
N = sum(null_hypothesis)
expected = (null_hypothesis[1] + null_hypothesis[3]) / N
expected -= (null_hypothesis[1] + null_hypothesis[2]) / N
oe_obs = np.abs(eb / nb - ea / na - expected)
q = multivariate_hypergeometric(null_hypothesis, na, B=B)
sa = q[1, :] + q[2, :]
sb = (null_hypothesis[1] - q[1, :]) + (null_hypothesis[3] - q[3, :])
z = np.abs(sb / nb - sa / na - expected)
pval = np.mean(z >= oe_obs)
return pval
null_hypothesis = [700, 150, 50, 100]
na = 500
nb = 500
ea = 100
eb = 130
pval = stat_sig_multihypergeometric(na, nb, ea, eb, null_hypothesis)
print(pval)
```

Different null hypotheses of the form considered above can have the same corresponding population treatment effect. Typically we are more concerned with the treatment effect than with the exact details of the null hypothesis that gives rise to it. For example, we would typically specify the null hypothesis in terms of the population treatment effect, even knowing that there is more than one configuration, $\{Q_i\}$, represented by this null hypothesis. The p-value of any data set against this “treatment effect null hypothesis” is the maximum over all p-values computed for “configuration null hypotheses” having the same treatment effect. In our example, the population treatment effect is $25\%$. Imagine computing p-values for each null hypothesis $\{Q_i\}$ having $\frac{Q_4 - Q_3}{Q_2 + Q_3} = 25\%$. The largest p-value is the one we use for deciding whether the hypothesized treatment effect is consistent with the data.

If *all* p-values corresponding to a particular treatment effect are
below the threshold used for significance, we can rule out that
treatment effect as being inconsistent with the observed data. If
*any* p-values are above that threshold, we deem the corresponding
treatment effect to be consistent with the data. The set of all
treatment effects consistent with the data is called a confidence
interval on the true (unknown) treatment effect. When using a p-value
threshold of $\alpha$, the resulting confidence interval is said to
have coverage $100(1-\alpha)\%$. For example, when using
$\alpha=0.05$, we say the confidence interval has coverage $95\%$. We
describe this as a $95\%$ confidence interval. The smaller the p-value
threshold we use for deciding statistical significance, the fewer null
hypotheses will be consistent with the data, leading to smaller
confidence intervals.

In general, computing confidence intervals requires computing p-values for every set of $\{Q_i\}$ corresponding to a range of treatment effects. I don’t think anyone actually does this; even with modern computers that would take too long! Instead, we use approximations based on assumptions of independence or normal distributions. These give a fast way of calculating intervals that are typically valid provided sample sizes are large enough. These approximations often conceal what we are really trying to calculate: a set of treatment effects consistent with the data! When analyzing any A/B test or other experiment, we need to understand what conclusions are not ruled out by the data, and confidence intervals are one way of doing that.