Statistical Power

Image courtesy Brett Sayles

A/B Testing Series

  1. Random Sampling
  2. Statistical Significance
  3. Fisher's Exact Test
  4. Counterfactuals and Causal Reasoning
  5. Statistical Power
  6. Confidence Intervals

Introduction

In previous posts, we have discussed how random segmentation allows us to investigate the relationship between a purported cause and an observed effect. The concept of statistical significance allows us to quantify the evidential strength of such a relationship. Even when there really is a causal relationship, a particular experiment may not be able to detect it. We need enough experimental units (i.e. test subjects) to achieve a statistically significant result. The concept of statistical power is one way of quantifying the sensitivity of the experiment, and allows us to compare the sensitivities of experiments having different sample sizes or using different methodologies.

A simple example will show why random segmentation can fail to uncover a true causal relationship when sample sizes are small. We will continue our Counterfactual example involving email subject lines.

Recipient $A$ $B$
Alice x x
Brian x
Charlotte x
David x x
Emily x
Frank x
George
993 Others
Totals 3 5

The above table shows how different email recipients will respond to two candidate subject lines $A$ and $B$. Recall that, in reality, we could never know how an individual would respond to a subject line they did not receive; we can only observe how they respond to the subject line they do receive. But if we had a crystal ball that let us peer into an alternate reality, we could construct a table like the above. Such a conceptual device allows us to speculate about what it means for an event or agent to cause an outcome. In this case, the presence of an “x” in a particular column indicates that the corresponding recipient will open the email if they receive the corresponding subject line. For example, Brian will open the email if he receives subject line $A,$ but not if he receives $B$. Because of this, we conclude that he opened the email because he received subject line $A$. In contrast, since Alice will open the email regardless of the subject line she receives, the subject line itself does not explain her behavior. Examining the table, we see that the subject line is part of the causal explanation for the behavior of Brian, Charlotte, Emily, and Frank, but for no one else.

Without a crystal ball, we can only speculate about an individual’s response to a subject line they did not receive; however, we can use random segmentation to investigate the relative effectiveness of the two subject lines. Splitting the audience randomly in half, we send subject line $A$ to one group, and subject line $B$ to the other. Comparing the email open rates for both groups provides insight into which subject line is better. The following python code simulates the group assignments. (Statistics is not a spectator sport! Actually open up python and run the code yourself!)

import numpy as np

N = 1000
Q_Alice = 2      # Opens email regardless
Q_Brian = 1      # Only opens email if receives A
Q_Charlotte = 3  # Only opens email if receives B
Q_George = N - Q_Charlotte - Q_Brian - Q_Alice

n = int(N / 2)

# Strategy from https://stackoverflow.com/questions/35734026/numpy-drawing-from-urn
# Basically, we sample n people from a population of N recipients, Q_George of whom
# are like George, the remainder of which are "not like George". Our sample then
# contains qA_George people like George and (n - qA_George) people not like George.
# Whichever "Georges" are not in the sample are by definition left for the second
# group.

qA_George = np.random.hypergeometric(Q_George, Q_Charlotte + Q_Brian + Q_Alice, n)
qB_George = Q_George - qA_George

# Next we sample (n - qA_George) users from a population of N - Q_George recipients,
# Q_Charlotte of whom are like Charlotte. This gives the number of Charlotte-like
# recipients in our A group.
# If qA_George = n, we'll get a ValueError; that means our A group consists entirely
# of Georges.
try:
    qA_Charlotte = np.random.hypergeometric(Q_Charlotte, Q_Brian + Q_Alice, n - qA_George)
except ValueError:
    qA_Charlotte = 0

qB_Charlotte = Q_Charlotte - qA_Charlotte

# Ditto Brian
try:
    qA_Brian = np.random.hypergeometric(Q_Brian, Q_Alice, n - qA_George - qA_Charlotte)
except ValueError:
    qA_Brian = 0

qB_Brian = Q_Brian - qA_Brian

qA_Alice = n - qA_George - qA_Charlotte - qA_Brian
qB_Alice = Q_Alice - qA_Alice

print(qA_George, qA_Charlotte, qA_Brian, qA_Alice)
print(qB_George, qB_Charlotte, qB_Brian, qB_Alice)

# Number of email opens in group A

# Any Alices in the group will open the email because they open the
# email no matter what. Additionally, any Brians in the group will
# open the email *because* they recieve subject line A.

eo_A = qA_Alice + qA_Brian

# Number of email opens in group B.
eo_B = qB_Alice + qB_Charlotte
print(eo_A, eo_B)

Let’s say that after dividing the groups, Alice, Brian, David, Emily and Frank wind up in group $A,$ while Charlotte and George wind up in group $B$. The unnamed recipients are divided randomly as well, but do not concern us since they, like George, do not open the email regardless of the subject line they recieve. Alice, Brian, and David all open the email. (Thanks to our crystal ball, we know that Brian opens the email because he received subject line $A,$ whereas Alice and David open it for some unknown reason. We would never know why any individual opened the email in reality.) In group $B,$ only Charlotte opens the email. The observed open rate in group $A$ is 3/500, while for group $B$ it is 1/500.

The observed result, that subject line $A$ is better than subject line $B,$ is incorrect. The original table shows that, if we sent subject line $B$ to everyone, 5 people would open the email; whereas, if we sent subject line $A$ to everyone, 3 people would open. But because we are dividing the groups randomly, by chance, the only 3 recipients who respond favorably to subject line $A$ all wound up in that group. Of the 5 recipients who respond favorably to subject line $B,$ only 1 wound up in the corresponding group. That is the downside of random segmentation!

Let’s repeat the random segmentation a few more times, to see what sorts of outcomes are likely. It becomes cumbersome to list which individuals are assigned to which groups; instead, we will simply list how many users in each group opened the email. In each simulation, each group consists of exactly 500 recipients. (We can imagine we wrote names on pieces of paper and drew them out of a hat to assign recipients to groups. It is important to note this is not the only way of assigning recipients to groups, and the method does indeed matter.)

Simulation $A$ $B$ Winner Stat. Sig.
1 3 1 $A$ No
2 3 3 tie No
3 2 1 $A$ No
4 2 3 $B$ No
5 1 3 $B$ No
6 3 1 $A$ No
7 1 4 $B$ No
8 1 5 $B$ No
9 1 5 $B$ No
10 0 4 $B$ No

For example, the first row summarizes the results of the simulation we have already discussed, in which 3 recipients in group $A$ opened the email, and 1 recipient in group $B$ opened. In 3 out of 10 simulations, the observed open rate of group $A$ exceeded that of group $B$. If we accepted the results at face value, we would be wrong $30\%$ of the time. In none of the simulations were the results statistically significant (using a $0.05$ p-value threshold). Statistical significance helps us avoid drawing the wrong conclusion, but also prevents us from rightfully declaring $B$ the winner, even when it handsomely outperformed $A$. In fact, in a thousand simulations, we did not get a statistically significant result in any of them. Clearly our experiment is not sensitive enough to detect the difference. We say that the statistical power is zero.

Statistical Power

Statistical power is defined as the probability of rejecting the Null Hypothesis under the conditions of a specific Alternative Hypothesis. An Alternative Hypothesis is a particular scenario in which the treatment does indeed have an effect on the response. When such a causal relationship exists, an experiment with high power will most likely be able to establish that connection.

The power of an experiment depends on:

  • The effect strength (i.e. how big an impact the treatment has on the response)
  • The number of experimental units or test subjects
  • The way test subjects are assigned to groups
  • The way we test for statistical significance

Of these, the first factor is completely beyond our control. The effect strength is what it is; we are merely trying to ascertain its nature. The larger the effect, the more easily we can establish it. The number of experimental units, also known as the sample size, may or may not be flexible. Sometimes we can simply pay more or wait longer for more test subjects; other times this too is beyond our control. The more test subjects in our experiment, the more powerful it will be, all else being equal. The final two elements are often overlooked, but we will see they play an important role in statistical power as well.

Let’s see a few examples of statistical power at work. First, let’s say our audience consists of one hundred thousand recipients instead of a thousand, but the demographics of this expanded audience are consistent with our original example. So there are a hundred recipients like Brian, who will only open the email if they receive subject line $A,$ and two hundred recipients like Alice and David, who will open the email regardless of the subject line they receive.

We randomly segment the audience into two groups of fifty thousand. We send subject line $A$ to the first group, and subject line $B$ to the second group. In the first group, 146 recipients opened the email for an open rate of $0.292\%$. In the second group, 257 recipients opened the email for an open rate of $0.514\%$. The first thing we note is that the observed open rates are pretty close to the population open rates of $0.3\%$ and $0.5\%,$ respectively. (Recall in this particular Alternative Hypothesis, 3 out of 1000 recipients will open the email if we send subject line $A$ to everyone; whereas, 5 out of 1000 will open after receiving subject line $B$.) The close approximation of the sample and population open rates is a consequence of the law of large numbers, which fundamentally explains why random segmentation works, but also necessitates large sample sizes.

We can use the same python code from our previous post to determine whether the result is statistically significant:

N  = 100000    # Audience size
n  =  50000    # Experiment group size
ea =  146      # Email opens in group A
eb =  257      # Email opens in group B
good = ea + eb # Total email opens
bad = N - good # Number of non-email openers

# Observed outcome extremity
oe_obs = np.abs(eb / (N - n) - ea / n)

B = 1000000 # Number of simulations
p = 0
for i in range(B):
    # Simulate how many email opens we might randomly sample
    sa = np.random.hypergeometric(good, bad, n)

    # Number of email openers who are necessarily in the second group
    sb = good - sa

    # Simulated outcome extremity
    oe_sim = np.abs(sb / (N - n) - sa / n)

    if oe_sim >= oe_obs:
       p += 1

pval = p / B
if pval < 0.05:
    print('Statistically Significant')
else:
    print('Not Statistically Significant')

We find the result is indeed statistically significant. Huzzah! We can repeat this process again and again, checking whether the results are statistically significant each time. Here are the results of 10 simulations:

Simulation $A$ $B$ Winner Stat. Sig.
1 146 257 $B$ Yes
2 164 247 $B$ Yes
3 169 234 $B$ Yes
4 152 257 $B$ Yes
5 149 257 $B$ Yes
6 149 239 $B$ Yes
7 155 238 $B$ Yes
8 162 255 $B$ Yes
9 145 244 $B$ Yes
10 151 254 $B$ Yes

Whereas an audience of a thousand led to zero power, it would appear an audience of a hundred thousand has effectively $100\%$ power. It isn’t all-or-nothing, however; for sample sizes between a thousand and a hundred thousand, the power will be nontrivial.

Sample Size Power Sample Size Power
1,000 $0.0\%$ 15,000 $43\%$
2,000 $3.2\%$ 20,000 $58\%$
3,000 $5.4\%$ 30,000 $79\%$
4,000 $8.8\%$ 40,000 $90.8\%$
5,000 $10.8\%$ 50,000 $96.1\%$
10,000 $27\%$ 100,000 $99.968\%$

Around twenty thousand is a clear turning point. By the time we have fifty thousand recipients, it is highly likely we would be able to detect an effect similar to the one considered here. Of course, we do not know the actual effect candidate subject lines have on email opens. That’s why we’re doing the experiment! But by simulating a few plausible alternative hypotheses, we can determine what sample sizes would be needed to reliably detect interesting effects.

Often times, the sample size needed to detect a hoped-for effect is impractically large. Even at large tech companies that can run experiments on millions of users, executives might be unwilling to wait even a week for results. In that case, an extremely underutilized strategy is simply to use a higher p-value threshold for significance testing. While $0.05$ is a standard threshold for statistical significance, it is nonetheless arbitrary. In the spirit of “move fast and break things”, I will often advocate using a threshold as high as 0.2.1 The risk, of course, is declaring a winner prematurely. We might get stuck with a subject line that in fact has a lower open rate, hurting the business. But in situations where we want to learn fast, this might be an acceptable risk.2 The below table shows the power for different sample sizes using a p-value threshold of 0.2.

Sample Size Power Sample Size Power
1,000 $6.2\%$ 15,000 $74\%$
2,000 $15\%$ 20,000 $84\%$
3,000 $21\%$ 30,000 $94.5\%$
4,000 $26\%$ 40,000 $98.2\%$
5,000 $33\%$ 50,000 $99.5\%$
10,000 $57\%$ 100,000 $99.999\%$

Conclusion

Statistical power is an important concept for anyone running A/B tests. It is a holistic measure of the sensitivity of the experiment, depending not only on the effect size, but also on details of how the experiment is administered and analyzed. Simulating experiments is an intuitive, albeit computationally intensive approach to estimating power. Taking power into consideration when planning an experiment dramatically improves the value of the test.3

Update 2018/07/04

As I was working on my next post, I realized I had not run nearly enough simulations to get a precise estimate of the power. Ironically, my post on statistical power was itself under-powered. The two tables showing power (one using a p-value threshold of $0.05$, the other, $0.20$), initially used $400$ simulations. I have updated the tables using $100,000$ simulations (it took over an hour to generate each individual table entry). This did not materially affect the narrative, but does help me sleep better at night.

For improved transparency, I have posted the script I used as well as the raw results of my simulations to Github. You can run it from the command line by downloading it to your local machine and running:

$ python power.py --alpha 0.05 --simulations 400 > results.txt

The code is a bit different than what is printed above. I have made some changes to leverage numpy’s vector processing, which speeds things up a bit, but (in my opinion) makes it harder to understand what the code is doing. The code also makes mysterious use of a function called wilson_ci, which we will discuss in our next post.

Subscribe to Adventures in Why

* indicates required

  1. It is important to note that human health and safety is never at risk in the sorts of experiments I run. ↩︎

  2. Only a fool takes big risks and then acts surprised when it blows up in his face. Ye be warned. ↩︎

  3. Cover image courtesy Brett Sayles↩︎

Bob Wilson
Bob Wilson
Marketing Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.

Related