Statistical Power
A/B Testing Series
 Random Sampling
 Statistical Significance
 Fisher's Exact Test
 Counterfactuals and Causal Reasoning
 Statistical Power
 Confidence Intervals
Introduction
In previous posts, we have discussed how random segmentation allows us to investigate the relationship between a purported cause and an observed effect. The concept of statistical significance allows us to quantify the evidential strength of such a relationship. Even when there really is a causal relationship, a particular experiment may not be able to detect it. We need enough experimental units (i.e. test subjects) to achieve a statistically significant result. The concept of statistical power is one way of quantifying the sensitivity of the experiment, and allows us to compare the sensitivities of experiments having different sample sizes or using different methodologies.
A simple example will show why random segmentation can fail to uncover a true causal relationship when sample sizes are small. We will continue our Counterfactual example involving email subject lines.
Recipient  $A$  $B$ 

Alice  x  x 
Brian  x  
Charlotte  x  
David  x  x 
Emily  x  
Frank  x  
George  
993 Others  
Totals  3  5 
The above table shows how different email recipients will respond to two candidate subject lines $A$ and $B$. Recall that, in reality, we could never know how an individual would respond to a subject line they did not receive; we can only observe how they respond to the subject line they do receive. But if we had a crystal ball that let us peer into an alternate reality, we could construct a table like the above. Such a conceptual device allows us to speculate about what it means for an event or agent to cause an outcome. In this case, the presence of an “x” in a particular column indicates that the corresponding recipient will open the email if they receive the corresponding subject line. For example, Brian will open the email if he receives subject line $A,$ but not if he receives $B$. Because of this, we conclude that he opened the email because he received subject line $A$. In contrast, since Alice will open the email regardless of the subject line she receives, the subject line itself does not explain her behavior. Examining the table, we see that the subject line is part of the causal explanation for the behavior of Brian, Charlotte, Emily, and Frank, but for no one else.
Without a crystal ball, we can only speculate about an individual’s response to a subject line they did not receive; however, we can use random segmentation to investigate the relative effectiveness of the two subject lines. Splitting the audience randomly in half, we send subject line $A$ to one group, and subject line $B$ to the other. Comparing the email open rates for both groups provides insight into which subject line is better. The following python code simulates the group assignments. (Statistics is not a spectator sport! Actually open up python and run the code yourself!)
import numpy as np
N = 1000
Q_Alice = 2 # Opens email regardless
Q_Brian = 1 # Only opens email if receives A
Q_Charlotte = 3 # Only opens email if receives B
Q_George = N  Q_Charlotte  Q_Brian  Q_Alice
n = int(N / 2)
# Strategy from https://stackoverflow.com/questions/35734026/numpydrawingfromurn
# Basically, we sample n people from a population of N recipients, Q_George of whom
# are like George, the remainder of which are "not like George". Our sample then
# contains qA_George people like George and (n  qA_George) people not like George.
# Whichever "Georges" are not in the sample are by definition left for the second
# group.
qA_George = np.random.hypergeometric(Q_George, Q_Charlotte + Q_Brian + Q_Alice, n)
qB_George = Q_George  qA_George
# Next we sample (n  qA_George) users from a population of N  Q_George recipients,
# Q_Charlotte of whom are like Charlotte. This gives the number of Charlottelike
# recipients in our A group.
# If qA_George = n, we'll get a ValueError; that means our A group consists entirely
# of Georges.
try:
qA_Charlotte = np.random.hypergeometric(Q_Charlotte, Q_Brian + Q_Alice, n  qA_George)
except ValueError:
qA_Charlotte = 0
qB_Charlotte = Q_Charlotte  qA_Charlotte
# Ditto Brian
try:
qA_Brian = np.random.hypergeometric(Q_Brian, Q_Alice, n  qA_George  qA_Charlotte)
except ValueError:
qA_Brian = 0
qB_Brian = Q_Brian  qA_Brian
qA_Alice = n  qA_George  qA_Charlotte  qA_Brian
qB_Alice = Q_Alice  qA_Alice
print(qA_George, qA_Charlotte, qA_Brian, qA_Alice)
print(qB_George, qB_Charlotte, qB_Brian, qB_Alice)
# Number of email opens in group A
# Any Alices in the group will open the email because they open the
# email no matter what. Additionally, any Brians in the group will
# open the email *because* they recieve subject line A.
eo_A = qA_Alice + qA_Brian
# Number of email opens in group B.
eo_B = qB_Alice + qB_Charlotte
print(eo_A, eo_B)
Let’s say that after dividing the groups, Alice, Brian, David, Emily and Frank wind up in group $A,$ while Charlotte and George wind up in group $B$. The unnamed recipients are divided randomly as well, but do not concern us since they, like George, do not open the email regardless of the subject line they recieve. Alice, Brian, and David all open the email. (Thanks to our crystal ball, we know that Brian opens the email because he received subject line $A,$ whereas Alice and David open it for some unknown reason. We would never know why any individual opened the email in reality.) In group $B,$ only Charlotte opens the email. The observed open rate in group $A$ is 3/500, while for group $B$ it is 1/500.
The observed result, that subject line $A$ is better than subject line $B,$ is incorrect. The original table shows that, if we sent subject line $B$ to everyone, 5 people would open the email; whereas, if we sent subject line $A$ to everyone, 3 people would open. But because we are dividing the groups randomly, by chance, the only 3 recipients who respond favorably to subject line $A$ all wound up in that group. Of the 5 recipients who respond favorably to subject line $B,$ only 1 wound up in the corresponding group. That is the downside of random segmentation!
Let’s repeat the random segmentation a few more times, to see what sorts of outcomes are likely. It becomes cumbersome to list which individuals are assigned to which groups; instead, we will simply list how many users in each group opened the email. In each simulation, each group consists of exactly 500 recipients. (We can imagine we wrote names on pieces of paper and drew them out of a hat to assign recipients to groups. It is important to note this is not the only way of assigning recipients to groups, and the method does indeed matter.)
Simulation  $A$  $B$  Winner  Stat. Sig. 

1  3  1  $A$  No 
2  3  3  tie  No 
3  2  1  $A$  No 
4  2  3  $B$  No 
5  1  3  $B$  No 
6  3  1  $A$  No 
7  1  4  $B$  No 
8  1  5  $B$  No 
9  1  5  $B$  No 
10  0  4  $B$  No 
For example, the first row summarizes the results of the simulation we have already discussed, in which 3 recipients in group $A$ opened the email, and 1 recipient in group $B$ opened. In 3 out of 10 simulations, the observed open rate of group $A$ exceeded that of group $B$. If we accepted the results at face value, we would be wrong $30\%$ of the time. In none of the simulations were the results statistically significant (using a $0.05$ pvalue threshold). Statistical significance helps us avoid drawing the wrong conclusion, but also prevents us from rightfully declaring $B$ the winner, even when it handsomely outperformed $A$. In fact, in a thousand simulations, we did not get a statistically significant result in any of them. Clearly our experiment is not sensitive enough to detect the difference. We say that the statistical power is zero.
Statistical Power
Statistical power is defined as the probability of rejecting the Null Hypothesis under the conditions of a specific Alternative Hypothesis. An Alternative Hypothesis is a particular scenario in which the treatment does indeed have an effect on the response. When such a causal relationship exists, an experiment with high power will most likely be able to establish that connection.
The power of an experiment depends on:
 The effect strength (i.e. how big an impact the treatment has on the response)
 The number of experimental units or test subjects
 The way test subjects are assigned to groups
 The way we test for statistical significance
Of these, the first factor is completely beyond our control. The effect strength is what it is; we are merely trying to ascertain its nature. The larger the effect, the more easily we can establish it. The number of experimental units, also known as the sample size, may or may not be flexible. Sometimes we can simply pay more or wait longer for more test subjects; other times this too is beyond our control. The more test subjects in our experiment, the more powerful it will be, all else being equal. The final two elements are often overlooked, but we will see they play an important role in statistical power as well.
Let’s see a few examples of statistical power at work. First, let’s say our audience consists of one hundred thousand recipients instead of a thousand, but the demographics of this expanded audience are consistent with our original example. So there are a hundred recipients like Brian, who will only open the email if they receive subject line $A,$ and two hundred recipients like Alice and David, who will open the email regardless of the subject line they receive.
We randomly segment the audience into two groups of fifty thousand. We send subject line $A$ to the first group, and subject line $B$ to the second group. In the first group, 146 recipients opened the email for an open rate of $0.292\%$. In the second group, 257 recipients opened the email for an open rate of $0.514\%$. The first thing we note is that the observed open rates are pretty close to the population open rates of $0.3\%$ and $0.5\%,$ respectively. (Recall in this particular Alternative Hypothesis, 3 out of 1000 recipients will open the email if we send subject line $A$ to everyone; whereas, 5 out of 1000 will open after receiving subject line $B$.) The close approximation of the sample and population open rates is a consequence of the law of large numbers, which fundamentally explains why random segmentation works, but also necessitates large sample sizes.
We can use the same python code from our previous post to determine whether the result is statistically significant:
N = 100000 # Audience size
n = 50000 # Experiment group size
ea = 146 # Email opens in group A
eb = 257 # Email opens in group B
good = ea + eb # Total email opens
bad = N  good # Number of nonemail openers
# Observed outcome extremity
oe_obs = np.abs(eb / (N  n)  ea / n)
B = 1000000 # Number of simulations
p = 0
for i in range(B):
# Simulate how many email opens we might randomly sample
sa = np.random.hypergeometric(good, bad, n)
# Number of email openers who are necessarily in the second group
sb = good  sa
# Simulated outcome extremity
oe_sim = np.abs(sb / (N  n)  sa / n)
if oe_sim >= oe_obs:
p += 1
pval = p / B
if pval < 0.05:
print('Statistically Significant')
else:
print('Not Statistically Significant')
We find the result is indeed statistically significant. Huzzah! We can repeat this process again and again, checking whether the results are statistically significant each time. Here are the results of 10 simulations:
Simulation  $A$  $B$  Winner  Stat. Sig. 

1  146  257  $B$  Yes 
2  164  247  $B$  Yes 
3  169  234  $B$  Yes 
4  152  257  $B$  Yes 
5  149  257  $B$  Yes 
6  149  239  $B$  Yes 
7  155  238  $B$  Yes 
8  162  255  $B$  Yes 
9  145  244  $B$  Yes 
10  151  254  $B$  Yes 
Whereas an audience of a thousand led to zero power, it would appear an audience of a hundred thousand has effectively $100\%$ power. It isn’t allornothing, however; for sample sizes between a thousand and a hundred thousand, the power will be nontrivial.
Sample Size  Power  Sample Size  Power 

1,000  $0.0\%$  15,000  $43\%$ 
2,000  $3.2\%$  20,000  $58\%$ 
3,000  $5.4\%$  30,000  $79\%$ 
4,000  $8.8\%$  40,000  $90.8\%$ 
5,000  $10.8\%$  50,000  $96.1\%$ 
10,000  $27\%$  100,000  $99.968\%$ 
Around twenty thousand is a clear turning point. By the time we have fifty thousand recipients, it is highly likely we would be able to detect an effect similar to the one considered here. Of course, we do not know the actual effect candidate subject lines have on email opens. That’s why we’re doing the experiment! But by simulating a few plausible alternative hypotheses, we can determine what sample sizes would be needed to reliably detect interesting effects.
Often times, the sample size needed to detect a hopedfor effect is impractically large. Even at large tech companies that can run experiments on millions of users, executives might be unwilling to wait even a week for results. In that case, an extremely underutilized strategy is simply to use a higher pvalue threshold for significance testing. While $0.05$ is a standard threshold for statistical significance, it is nonetheless arbitrary. In the spirit of “move fast and break things”, I will often advocate using a threshold as high as 0.2.^{1} The risk, of course, is declaring a winner prematurely. We might get stuck with a subject line that in fact has a lower open rate, hurting the business. But in situations where we want to learn fast, this might be an acceptable risk.^{2} The below table shows the power for different sample sizes using a pvalue threshold of 0.2.
Sample Size  Power  Sample Size  Power 

1,000  $6.2\%$  15,000  $74\%$ 
2,000  $15\%$  20,000  $84\%$ 
3,000  $21\%$  30,000  $94.5\%$ 
4,000  $26\%$  40,000  $98.2\%$ 
5,000  $33\%$  50,000  $99.5\%$ 
10,000  $57\%$  100,000  $99.999\%$ 
Conclusion
Statistical power is an important concept for anyone running A/B tests. It is a holistic measure of the sensitivity of the experiment, depending not only on the effect size, but also on details of how the experiment is administered and analyzed. Simulating experiments is an intuitive, albeit computationally intensive approach to estimating power. Taking power into consideration when planning an experiment dramatically improves the value of the test.^{3}
Update 2018/07/04
As I was working on my next post, I realized I had not run nearly enough simulations to get a precise estimate of the power. Ironically, my post on statistical power was itself underpowered. The two tables showing power (one using a pvalue threshold of $0.05$, the other, $0.20$), initially used $400$ simulations. I have updated the tables using $100,000$ simulations (it took over an hour to generate each individual table entry). This did not materially affect the narrative, but does help me sleep better at night.
For improved transparency, I have posted the script I used as well as the raw results of my simulations to Github. You can run it from the command line by downloading it to your local machine and running:
$ python power.py alpha 0.05 simulations 400 > results.txt
The code is a bit different than what is printed above. I have made
some changes to leverage numpy’s vector processing, which speeds
things up a bit, but (in my opinion) makes it harder to understand
what the code is doing. The code also makes mysterious use of a
function called wilson_ci
, which we will discuss in our next post.

It is important to note that human health and safety is never at risk in the sorts of experiments I run. ↩︎

Only a fool takes big risks and then acts surprised when it blows up in his face. Ye be warned. ↩︎

Cover image courtesy Brett Sayles. ↩︎