Counterfactuals and Causal Reasoning
A/B Testing Series
 Random Sampling
 Statistical Significance
 Fisher's Exact Test
 Counterfactuals and Causal Reasoning
 Statistical Power
 Confidence Intervals
Introduction
So far in this series we have only considered the possibility that our actions have no effect on an observed outcome. This disheartening possibility is called the Null Hypothesis. Whenever we are using random segmentation to investigate the causal relationship between an experience we are providing and the response of an audience, the observed treatment effect must be large relative to what is plausibly attributable to random chance. To use the standard terminology, the effect must be statistically significant.
When either the audience size, or the causal effect is small, it is unlikely we will achieve statistical significance. Just because a result is not statistically significant, does not mean there is no treatment effect. In order to understand what conclusions we may rightfully draw from such “null” results, we need a better understanding of what sort of outcomes are possible when our actions do indeed affect the behavior of an audience.
Counterfactuals
A popular approach to causal inference is based on counterfactuals. The Stanford Encyclopedia of Philosophy provides an excellent discussion of the history and development of this approach.^{1} The basic idea is to consider what would have happened if a specific event had not occurred, or a specific agent had not been present. We compare this counterfactual reality with what was actually observed following said event. As discussed in the Random Sampling article, this is easier said than done.
We can make use of the crystal ball we introduced in that post to provide an example of precisely what we mean by this. Recall from our subject line example that we have an audience of 1000 email recipients, and we are investigating the impact of two candidate subject lines, $A$ and $B$, on open rates. We send the first subject line to everyone, and three people open the email. We peer into our crystal ball, which conveniently provides a window into a counterfactual reality in which we had sent the second subject line to everyone. In that reality, five people opened the email. With a crystal ball, not only can we determine which subject line leads to more email opens, we can actually list which recipients opened it in which reality.
Recipient  $A$  $B$ 

Alice  x  x 
Brian  x  
Charlotte  x  
David  x  x 
Emily  x  
Frank  x  
George  
Totals  3  5 
In the table above, we see that Alice and David opened the email regardless of the subject line they received (an “x” denotes the person opened the email when receiving a particular treatment). George—and the remaining audience members not listed—did not open the email regardless of the subject line they received. None of these recipients’ behaviors were affected by the subject line.
In contrast; Brian’s, Charlotte’s, Emily’s, and Frank’s behaviors were indeed influenced by the subject line they received. Brian only opened the email in the reality where he received subject line $A$; whereas, Charlotte, Emily, and Frank only opened the email in the reality where they received subject line $B$.
Looking at the table, it is clear sending subject line $B$ to everyone is preferable to sending subject line $A$ to everyone. (Actually, sending subject line $B$ to everyone except Brian, and sending subject line $A$ to him, is the best option of all. For now, we assume we only care about determining which subject line is the best overall. Providing tailored experiences to each individual is much more challenging, and is outside the scope of the present discussion.) Unfortunately, the only way to generate the table is with a crystal ball, which doesn’t actually exist. We can only observe the response to the treatment we actually provide, and we can only speculate about the response to treatments we do not provide. That is what makes causal inference so difficult.
With random segmentation, we randomly assign subject lines to audience members and observe the results. This allows us to fill in part of the table. For example, suppose we randomly select Alice, Charlotte, David, and Frank (and 496 others) to receive subject line $A$, and the remaining people to receive subject line $B$. This leads to the following table.
Receives $A$  $A$  $B$  Receives $B$  $A$  $B$ 

Alice  x  ?  Brian  ?  
Charlotte  ?  Emily  ?  x  
David  x  ?  George  ?  
Frank  ?  (497 others)  ?  
(496 others)  ?  
Totals  2  ?  ?  1 
Consistent with the previous table, Alice and David open the email; however, it is important to note they do not open it because they received subject line $A$. They would have opened it even if they had received subject line $B$. But we only know this because of our crystal ball. In reality, we have no idea what any of the first group would have done had they received subject line $B$. That’s why there are question marks in that column. Indeed, we can only speculate about why any of these individuals opened or did not open the email.
Similarly, for the group that receives subject line $B$, only Emily opened the email. In light of the original table, Emily did indeed open the email because she received subject line $B$. We know this because in a parallel universe where she received subject line $A$, she did not open the email. But again, we can never know this in any realistic scenario.
What we do know is how people reacted to the subject lines they received. Out of the 500 people randomly selected to receive subject line $A$, 2 opened the email, for an open rate of $0.4\%$; whereas, the open rate in the second group was only 1 out of 500 or $0.2\%$. Taking the results at face value, we would conclude that subject line $A$ is twice as good as subject line $B$, when in fact it is worse.
Random segmentation does not always enable us to determine which treatment gives the best result, as this example shows, but neither does any other method. What random segmentation does provide is:
 A method that gives more reliable answers the larger the audience.
 Extremely precise measures of how reliable the method itself is, for audiences large or small.
I am unaware of any other method that does the same, which is why random segmentation is considered the gold standard of causal inference.
What does “Why?” mean anyway?
In the counterfactual approach, we are interpreting “why” in a specific way. We are fundamentally asking whether the occurrence of a specific event, or the presence of a specific agent was necessary and sufficient for a particular outcome. If not necessary, the outcome would have happened even without the event or agent; if not sufficient, the event or agent is an incomplete explanation.
When we randomly selected Alice to receive subject line $A$, the latter was not necessary for Alice to open the email; she would still have opened the email had she received subject line $B$. On the other hand, it would appear that receiving subject line $B$ was both necessary and sufficient for Emily to open the email. In more simple language, we say that Emily opened the email because she received subject line $B$.
This logic is only applicable in a particular context. If we speculate about a third subject line, $C$, and if we believe Emily would have opened the email had she received $C$, then in that context, $B$ was not necessary. If we additionally know that subject lines $B$ and $C$ use informal language in contrast to $A$, and that Emily does not appreciate formality, we might say that the subject line alone is not a complete causal description. Rather, the tone of the subject line and Emily’s preferences, together, form a more complete explanation for the outcome. Then in that case, $B$ is not sufficient. Conclusions about the causal relationship between an event or agent and an outcome depend on context and indeed on the goals of the inquiry.
While counterfactuals and random segmentation form a powerful and practically useful framework for causal inference, the approach has limitations. When we ask, “Why did Emily open the email?”, the answer according to this approach is, “Because she received subject line $B$.” The approach offers no insight into what it was about subject line $B$ that appealed to Emily. Neither does it offer any insight into what sort of subject line would appeal to George. Because of this, we cannot extrapolate what the response to other, untested subject lines might be.
A suitably rich collection of subject lines may enable us to investigate these issues. The theory of Experiment Design—and, presumably, theories of marketing and human psychology—have more to say on these issues, which are outside the scope of the present discussion. Nonetheless, in many situations, we are merely attempting to determine the best option from a particular set of alternatives. The counterfactual framework provides a logically compelling approach for considering what it actually means for one option to be best. Random segmentation provides a method not only for determining what that best option is, but also for quantifying the reliability of our conclusions. While there are many important questions that cannot be addressed within this framework, random segmentation is both practical and valuable. ^{2}

Menzies, Peter, “Counterfactual Theories of Causation”, The Stanford Encyclopedia of Philosophy (Winter 2017 Edition), Edward N. Zalta (ed.). ↩︎

Cover photo courtesy of Burak Kebapci. ↩︎