Modes of Inference in Randomized Experiments
1 Introduction
Here’s a Statistics 101 question. Suppose we have random samples from two populations: \( F \) and \( G \). We want to know whether \( F \) and \( G \) have the same distribution. We might persuade ourselves \( F \) and \( G \) differ, if at all, only in their means. If the means of our two samples differ radically, that’s evidence \( F \) and \( G \) differ.
Statistics 101 teaches about t-tests for this kind of situation. If the difference in sample means is statistically significant, we conclude \( F \) and \( G \) really are different. I am glossing over many nuances, many alternative approaches, but I have no need to complicate things here. We use samples to compare populations.
What about experiments? In an experiment, we randomly assign people to either a treatment or control group. Now we are seemingly back in the Stats 101 scenario: we have observations from two groups. Many data scientists reach for the t-test to decide whether the observed difference in means is statistically significant. If so, they conclude the treatment has some nonzero effect. The t-test compares populations. In an experiment, what are these populations?
In any kind of causal inference, experimental or observational, the key metaphysical quantities are the potential outcomes. Each person, \( i, \) in the experiment has a potential outcome corresponding to each condition: treatment, \( Y_i(1), \) and control, \( Y_i(0). \) We define causal effects in terms of comparisons of potential outcomes. The effect of treatment on individual \( i \) is \( Y_i(1) - Y_i(0). \) The overall effect is some kind of aggregation of the individual effects, perhaps a simple average.
When we randomly assign people to treatment or control, we get random samples from the two potential outcome populations, \( Y(1) \) and \( Y(0). \) When we do a t-test, we’re comparing these populations. If we conclude they differ, we conclude the treatment had an effect.
Still, there are a couple differences from the Stats 101 scenario. The potential outcome populations are finite. The t-test is based on samples from infinite populations. The potential outcome populations derive from pairs of quantities for each person: they’re clearly entwined. The t-test is based on independent populations. Do these differences matter?
2 The Problem and the Opportunity
(Freedman 2008) discusses the peril of ignoring the special structure of the potential outcomes when analyzing experiments. Analysts sometimes start with a regression model for the outcome: \[ Y = \alpha + \tau \cdot Z + \beta^T X + \epsilon, \] where \( Z \) is a binary treatment indicator and \( X \) is some collection of observed covariates. We fit the model, and since the treatment is randomly assigned, the coefficient, \( \tau, \) has a causal interpretation. In this story, the model comes first, and the random treatment assignment comes only at the end—a finishing salt—when we interpret the results.
Freedman wrote, “Since randomization does not justify the models, almost anything can happen.” What he meant, and what his simulations support, is that treatment effects we estimate from regression models might be correct, or they might not be, and it’s not always obvious when the regression modeling strategy works. This isn’t the sort of thing you’ll hear in Stats 101.
To cut off any panic that might be growing in my audience, rest assured that your estimates are probably fine. The scenario Freedman considered is only a problem in small sample sizes (Lin 2013). The point is that regression models do not automatically give correct answers.
Randomized experiments have a special structure, similar to but not entirely the same as sampling problems. Ignoring that special structure can lead to wrong answers. Putting that special structure front and center not only avoids these and other problems but also provides opportunities for variance reduction and even reliable inferences in observational studies (Rosenbaum 2002a, 2002b).
3 Modes of Inference in Randomized Experiments
(Rubin 1990) discussed three “formal modes of inference” in addition to the regression model discussed above. Regarding the latter (superpopulation frequency inference), Rubin finds it the least appealing option, in no small part because: “users of such models often appear to have absolutely no idea of the structure of an underlying framework for causal effects, since the identical models are more usually applied to purely descriptive data” (page 290).
Instead, Rubin promotes randomization-based tests of sharp null hypotheses, randomization-based inference for sampling distributions of estimands, and Bayesian inference. All three of these approaches might be called randomization inferences, since all three put the random assignment of treatments at the core of the analysis. Rubin opined in this 1990 paper that randomization inference was rarely used, not because it doesn’t work well, but because no one seems to have heard of it. From my vantage point in 2024, little has changed.
In the remainder of this note, I will comment briefly on these three randomization-based approaches, and why anyone analyzing experiments should reach for them first.
4 Tests of Sharp Null Hypothesis
A sharp null hypothesis specifies the treatment effect for each individual in the study. Equivalently, it specifies the counterfactual outcome for each person. Examples include:
- the hypothesis of no effect for any individual,
- a constant additive effect for everyone,
- a constant effect within pre-defined subpopulations, or
- a long list of individual treatment effects.
Because sharp null hypotheses provide the full set of potential outcomes, we may explore the consequences of any pattern of treatment assignments. The known probability of each pattern allow us to assign a probability to each such consequence, and we may therefore explore the likelihood of the observed outcome. The t-test may be motivated as an approximation to this method. This approach generalizes the mathematical notion of proof by contradiction to a stochastic setting (Rubin 2004).
We need not believe any particular hypothesis to check what it implies. In fact, such inferences implicitly check all possible hypotheses. We reject some as inconsistent with the data. We retain others, not as plausible necessarily, but more accurately as not inconsistent with the data.
The approach is thought-provoking, offering perhaps the purest form of causal inference. (Rosenbaum 2020, page 37, 2019, § 3) argue its virtues, promising assumption-free inferences. I have found it rewarding to reflect on what assumption-free means here. This flavor of randomization inference is feasible in many cases of practical significance (Rosenbaum 2001; Baiocchi et al. 2010), and completely intractable in others.
5 Inference for Sampling Distributions of Estimands
This approach starts with the formula for the causal estimand, expressed in terms of the potential outcomes and treatment assignment indicators. For example, we may write the difference in sample means as:
\begin{align} \widehat{\mathrm{ATE}} &= \frac{1}{N_T} \sum_{i \in T} Y_i^\textrm{obs} - \frac{1}{N_C} \sum_{i \in C} Y_i^\textrm{obs} \nonumber \\ &= \sum_{i} \left[ \frac{1}{N_T} \cdot Z_i \cdot Y_i^\textrm{obs} - \frac{1}{N_C} (1 - Z_i) \cdot Y_i^\textrm{obs} \right] \nonumber \\ &= \sum_{i} \left[ \frac{1}{N_T} \cdot Z_i \cdot Y_i(1) - \frac{1}{N_C} (1 - Z_i) \cdot Y_i(0) \right]. \label{eqn:ate} \end{align}
This estimator is unbiased for the average treatment effect:
\begin{align*} \mathbf{E}\left[ \widehat{\mathrm{ATE}} \right] &= \sum_{i} \left[ \frac{1}{N_T} \cdot \mathbf{E}[Z_i] \cdot Y_i(1) - \frac{1}{N_C} \cdot \mathbf{E}[1 - Z_i] \cdot Y_i(0) \right] \\ &= \sum_{i} \left[ \frac{1}{N_T} \cdot \frac{N_T}{N} \cdot Y_i(1) - \frac{1}{N_C} \cdot \frac{N_C}{N} \cdot Y_i(0) \right] \\ &= \frac{1}{N} \cdot \sum_{i} \left[ Y_i(1) - Y_i(0) \right] \\ &= \mathrm{ATE}. \end{align*}
The key difference between randomization inference and the regression modeling approach is we treat the outcomes as fixed, not random quantities. Instead, it is the treatment assignment that is random, for we, the experimenters, made it so.
Based on the variance of Equation (1), we calculate p-values and confidence intervals, using a procedure yet again resembling the t-test. For example, a 95% confidence interval is formed by taking the point estimate plus or minus 2 standard errors. This turns out to be challenging when treatment effects vary from person to person, but conservative methods are readily available (Imbens and Rubin 2015, § 6).
This approach relies on sample sizes large enough to satisfy the conditions of the central limit theorem. In contrast, sharp hypothesis-based approaches work with any sample size. The sampling distribution approach bypasses the computational difficulties that sometimes doom sharp hypothesis approaches.
6 Bayesian Methods
The Bayesian approach computes the posterior distribution of the counterfactual outcomes given observed outcomes, the pattern of treatment assignments (uninformative in randomized experiments but informative in observational studies), and other observed covariates. Armed with estimates of the counterfactual outcomes, we may compute posterior distributions for any causal estimands we wish, not just average treatment effects. The price of admission is a model for the outcomes given the covariates, and a prior distribution for any parameters in that model.
(Rubin 1990; Imbens and Rubin 2015, § 8) praise the Bayesian approach as the most flexible. Rubin also warns the method may be easily misused if efforts are not taken to validate the models. Rubin suggests using the other randomization approaches as part of such validation.
7 Summary and Recommendations
Discussing the evidence regarding smoking and cancer, (Hill 1965) wrote, “Like fire, the \( \chi^2 \) test is an excellent servant and a bad master.” We may take this more broadly as a warning to understand the tools we use.
Methods rooted in regression models, or otherwise relegating randomization to a finishing salt imbuing insights with a causal interpretation, may (and often do) give reliable answers. Or they may not. Methods rooted in the potential outcomes put randomization at the core of the analysis. They range from assumption-free methods based on sharp null hypotheses, to asymptotic approaches appropriate for large sample sizes, to Bayesian methods offering the most flexibility.
Rubin lamented, above all, how unfamiliar practitioners were with the central role of randomization in the analysis of experiments. Approaches exploiting randomization deliver more reliable inferences than methods neglecting it. Randomization inference should be the first method we reach for when analyzing experiments.