The Reasoned Basis for Inference in Experiments
In his 1935 book, “Design of Experiments”, Ronald Fisher described randomization as the “reasoned basis for inference” in an experiment. Why do we need a “basis” at all, let alone a reasoned one?
Causal effects are defined relative to counterfactual outcomes. To say that a person saw a movie because they saw an advertisement is to say that, had they not seen the advertisement, they would not have seen the movie. Yet this person did see the advertisement. We can only speculate about what they would have done otherwise. Anyone can speculate. Is speculation ever logically justified?
Randomization provides such justification. Randomization resolves an ambiguity in the definition of the counterfactual state (what it means for the person counterfactually not to have seen the ad). Randomization quantifies the “strength of evidence” by generalizing a powerful mathematical technique for establishing proof. In articulating this framework, we see that it is impossible to prove causation as we might prove a theorem; randomization provides perhaps the next best thing.
This note captures discussions regarding Paul Rosenbaum’s new book, “Causal Inference”. All references to Rosenbaum regard this book. My coworkers and I have been reading it, and the first few chapters have sparked interesting conversations about the “philosophical foundations” of causal inference. I want to thank David and Laurence both for reading the book with me, and for stimulating discussions.
Randomization Resolves the Ambiguity of Counterfactual States
Outside of an experiment, the notion of causation may be ambiguous. Rosenbaum gives the example of George Washington, who died on December 14, 1799. On December 13, doctors bled him in a misguided attempt to “restore a healthy balance in his humors”. Did Washington die because he was bled? Considering this question requires we speculate about a counterfactual world where Washington was not bled. If Washington survived in that world, then bleeding did indeed cause Washington’s death; if Washington died also in that world, then bleeding did not cause his death.
This counterfactual world does not exist. What does it mean to say “if Washington had not been bled”? Would the doctors have instead applied the next-best course of treatment? Would there have been no doctors in attendance at all? Would Martha Washington have made her husband some chicken soup? To make such speculation meaningful, we require a well-defined concept of this counterfactual world, and that would seem to require a greater degree of specification than is humanly possible.
Randomization offers one solution. Suppose a doctor uses the flip of a coin to decide the course of treatment to a patient. In this (unethical) scenario, the doctor must stand ready to apply either course of treatment, depending on how the coin flip turns out. If the coin comes up tails and the doctor says, “actually, I don’t know how to apply the corresponding treatment; it’s ambiguously defined”, then clearly the doctor was ill-equipped to allow randomization to decide the treatment assignment.
Randomization resolves the ambiguity in counterfactual states, not because randomization itself is necessary, but because randomization forces the experimenter to stand ready to apply either treatment. To say the counterfactual state is ambiguous is to claim the experimenter was not actually prepared to conduct the experiment.
Since randomization renders the counterfactual state unambiguous, it also renders the potential outcomes unambiguous. Causal effects may then be defined in terms of these potential outcomes.
The Possibility of Randomization Resolves the Ambiguity of Counterfactual States
Randomization is sufficient but perhaps unnecessary to render the potential outcomes well-defined. In observational studies, we must find some other mechanism for defining the counterfactual state. Drawing a close analogy to an experiment we would conduct may suffice.
For example, the ability to conduct user-level holdout studies in many digital marketing channels means it is straightforward to define the counterfactual state, even without a randomized experiment. These channels typically feature an auction. When considering the counterfactual movie-going behavior of a person who saw a particular ad, we might define the counterfactual state as the one replacing our ad with the auction runner-up. In this example, understanding the mechanism leading to the (non-randomized) treatment assignment clarifies plausible counterfactual states.
This thought process may convince us the counterfactual state is well-defined, but does not solve the other challenges of causal inference. People who saw the runner-up ad form a natural control in our example, but this insight is not actionable. Marketers typically would not know which ad was second place in the auction. They would not know which people were exposed to it. It’s one thing to have well-defined counterfactual states, and another thing to have a causal identification strategy.
Still, having well-defined causal states is a prerequisite for inference. It is up to the researcher to justify their approach, and align their identification strategy with their definition. Much of causal inference assumes the researcher has done that. This assumption is part of the Stable Unit Treatment Value Assumption (SUTVA) and is sometimes described as “unambiguous treatments”. SUTVA is really just a way of saying what is required of the practitioner, not something the practitioner uses to do it!
Causation Cannot be Established to the Level of Certainty
Fisher discussed the “sharp” null hypothesis of no effect for any unit. This sharp null hypothesis allows us to impute the counterfactual outcome for each unit: it’s the same as the observed outcome. Generally, any hypothesis that allows us to impute the counterfactual outcome for each unit is referred to as “sharp”. Is the hypothesis of no effect possible?
Any sharp hypothesis fills in the gaps in our knowledge, implying a full set of potential outcomes. Randomization then allows us to assign a probability distribution to any statistic based on the observed outcomes.
For example, randomization implies a distribution on the mean treatment minus control differences. In a completely randomized experiment assigning $n$ of $N$ units to the treatment condition, all ${N \choose n}$ such assignments occur with equal probability. Because we have a full set of potential outcomes, we can calculate the mean treatment minus control statistic associated with each assignment. If the observed statistic is $T$, and in $q$ of the possible assignments, the mean treatment minus control statistic is at least $T$, then the probability of such an outcome is $q / {N \choose n}.$ This probability is the p-value associated with the null hypothesis of no effect.
There is always at least one possible assignment where this is true: the assignment actually observed. Thus, the p-value is always $\geq 1 / {N \choose n}.$ No matter the sample size, it is always possible the null hypothesis of no effect is true, and the coin flips turned out just so as to create the illusion of an effect. Even a perfectly executed experiment with large sample size cannot prove causation to the level of logical certainty.
Rosenbaum shrugs this difficulty aside, writing, “Obviously the difference in survival could be due to chance providing we accept anything that is logically possible as realistically possible. No one does that, of course; you could not cross the street if you thought that way. Many things that are logically possible are ridiculously improbable.”
If we give up on the possibility of establishing causation to the level of logical certainty, what do we replace it with? Donald Rubin provided a much more compelling argument than Rosenbaum did here, drawing a connection to the mathematical technique of proof by contradiction.
Stochastic Proof by Contradiction
A common proof technique is to assume the opposite of what we are trying to prove, and look for contradictions. To prove the square root of 2 is irrational, we adopt the premise that it is rational. We do not adopt this premise because we believe it to be true, but rather because it allows us to create a world we can explore. The real world cannot contain a contradiction, so if the premise we adopted implies a contradiction, the premise must be false.
Because it is my favorite proof, I will share a perhaps familiar example. To prove the square root of 2 is irrational, we will adopt the premise that it is rational and can be represented as the ratio of two integers $p$ and $q$. Since this representation is not unique (if $\sqrt{2} = p/q,$ then $\sqrt{2} = (k \cdot p)/(k \cdot q)$ for all integers $k$), we require that $p$ and $q$ have no common factors.
Square both sides and re-arrange, giving $p^2 = 2 \cdot q^2$, so that $p^2$ must be an even number. Since an odd number squared is odd, $p$ must then be even, and can be written as $p = 2 \cdot r,$ so that $2 \cdot q^2 = 4 \cdot r^2,$ or $q^2 = 2 \cdot r^2.$ Then $q^2$ is even and so is $q$. But if both $p$ and $q$ are even, that contradicts the premise that $p$ and $q$ had no common factors. We thus conclude the square root of 2 cannot be represented as the ratio of two integers, and is thus irrational.
Note how the premise we adopted allowed us to create a world we can explore, even though we don’t actually believe the premise. Exploring that world, looking for a contradiction, is what allows us to prove what we set out to prove. Of course, if we weren’t clever enough to find that contradiction, that wouldn’t prove the premise to be true. We specifically adopt premises we expect will be easily dismantled. Proof by contradiction is a powerful technique in mathematics.
A sharp null hypothesis creates a world in the form of a complete set of potential outcomes. Randomization puts a probabilistic layer on top of that world, allowing us to say how likely certain outcomes are. In this narrative, randomization is not a premise – that’s how the experiment was conducted – randomization is a fact not a premise. The sharp null hypothesis is the premise. We don’t adopt that premise because we believe it to be true; we adopt it because it creates a world we can explore. In that world, perhaps the observed outcome is incredibly unlikely. It isn’t impossible; the probability associated with that outcome is always at least $1 / {N \choose n}$. But perhaps that outcome is incredibly unlikely.
In this world, we can never find a logical contradiction, but we can find an incredibly unlikely outcome, and we might consider that almost as convincing. Finding a contradiction disproves the premise, and leads us to consider the alternative proven. This is impossible in an experiment. Instead, finding an incredibly unlikely outcome leads us to reject the premise, and to accept the alternative, not to the level of logical certainty, but nevertheless to a high degree of confidence. Confidence, or “strength of evidence,” stands in for logical certainty, and is the best we can do when it comes to causation.
In Defense of Null Hypothesis Significance Testing
Note that in adopting a sharp null hypothesis, either of no effect for any individual, or for any specific pattern of effects, we are not claiming we believe that hypothesis to be true. When we adopted the premise the square root of 2 was rational, we weren’t claiming that to be true either! We adopt these premises to see if we can find a contradiction. We may or may not find one, but either way, we start from a place of trying to disprove the premise.
People sometimes argue against null hypothesis testing since they don’t believe null hypotheses are ever exactly true. This perspective misses the whole point! It would be like someone reading only the first sentence of the $\sqrt{2}$ proof (where we assume it is rational) and refusing to go any further because they don’t believe the premise. These premises are not to be believed; they are to be explored.
In testing a null hypothesis, we might find enough evidence to reject the hypothesis as false (not to the level of logical certainty but to a certain “strength of evidence”). Or we might not find such evidence, in the same way we might not have been clever enough to find a contradiction from the rationality of $\sqrt{2}.$ Failing to find evidence against the null hypothesis in no sense provides evidence in favor of the null hypothesis.
Summary
When we define causation in relation to counterfactual states, we start from a position that causation isn’t even a well-defined concept. Randomization is one mechanism for rendering well-defined counterfactual states and potential outcomes.
Though they are well-defined, counterfactual outcomes are unobservable, and so causal effects are unobservable. Randomization does not change this. Even with arbitrarily large sample sizes, randomization cannot establish causation to the level of logical certainty. But an experiment can be thought of as a “stochastic proof by contradiction”, replacing logical certainty with a high degree of confidence, and this is the best we can do in causal inference. It is in this sense that randomization provides the reasoned basis for inference in experiments.
References
This entire note discusses the first couple chapters of the 2023 book, “Causal Inference” by Paul Rosenbaum.
I came across the idea of “stochastic proof by contradiction” in Peng Ding’s 2016 paper, “A Paradox From Randomization-Based Causal Inference”. Ding gives credit for this idea to Donald Rubin’s 2004 paper, “Teaching Statistical Inference for Causal Effects in Experiments and Observational Studies”. Rubin uses this phrasing also in his 2015 book with Guido Imbens, “Causal Inference for Statistics, Social, and Biomedical Sciences”, chapter 5.
My defense of Null Hypothesis Significance Testing is influenced by Rosenbaum’s many other writings. For example, see Chapter 2 of his 2017 book, “Observation and Experiment”.