# Sensitivity Analysis for Matched Sets with One Treated Unit

- 1 Introduction
- 2 Intuition
- 3 Example: Survey-Based Measurement
- 4 Units, Potential Outcomes, and Treatment Effects
- 5 Test Statistics for Matched Sets
- 6 Strong Ignorability and The Distribution of Treatment Assignments
- 7 The p-value is Increasing in the Unobserved Covariates
- 8 Extreme Configurations
- 9 The Sensitivity of an Observational Study
- 10 The Capacity of an Observational Study
- 11 Sensitivity Analysis with Large Sample Sizes
- 12 When are Observational Studies Persuasive?
- 13 Summary
- 14 References

## 1 Introduction

Observational studies involve uncertainty from the possible influence of unobserved confounding factors, correlated with both treatment status and outcome. This uncertainty adds to that from the random structure of the treatment status. Controlled, randomized experiments involve only the latter source of uncertainty.

In an experiment, we use p-values and confidence intervals to quantify the uncertainty associated with a conclusion. In a sense, they quantify the strength of evidence of the study. The extra uncertainty present in an observational study invalidates p-values and confidence intervals as appropriate measures of strength-of-evidence.

In a series of papers, Paul Rosenbaum developed an approach for quantifying the strength-of-evidence in an observational study. I am working my way through these papers, starting with matched pairs (Rosenbaum 1987a) in my last post. In (Rosenbaum 1988), Rosenbaum extended the methods of the earlier paper to encompass matching with multiple controls.

## 2 Intuition

The paper discusses studies involving units divided into *strata,*
each containing one treated unit. In an experiment, we select at
random a unit from each stratum to be treated. The known selection
process results in a known distribution for the test statistic. This
known distribution allows us to calculate p-values, point estimates,
and confidence intervals.

In an observational study, units select treatment via some imperfectly understood mechanism. As a result, the distribution of the test statistic is unknown, possibly depending on unobserved factors, and we cannot perform statistical inference. Instead, the method of sensitivity analysis allows us to calculate plausible bounds on the p-value and point estimate, and to expand the confidence interval to account for the extra uncertainty.

The bounds on the p-value lead to a strength-of-evidence metric,
\( \Gamma^\star, \) appropriate for observational studies. This metric
tells us how unobserved factors would have to skew the study to
invalidate our conclusions. The metric does *not* tell us whether such
skew is present.

A sensitivity analysis makes the implicit, explicit, replacing statements like, “the study conclusions may be invalid because of unobserved factors,” with statements like, “to invalidate study conclusions, unobserved factors would need to be highly predictive of both the treatment selection and the outcome.” Such concrete statements let domain experts postulate and explore possible sources of bias.

## 3 Example: Survey-Based Measurement

The method described here is most useful when we have access to many more untreated units than treated units. For example, imagine fielding a survey with questions about exposure to TV advertising and perception of the brand. Supplemental questions would ask about likely confounding factors that might explain why a person would or would not have seen the TV ad.

Depending on the amount of overlap between the TV ad audience and the survey response pool, we might observe a small number of responses from people who saw the ad (treated) and a larger number who didn’t (untreated). Rather than match treated and untreated respondents one-to-one, throwing away the extra untreated responses, we can match multiple untreated respondents to each treated respondent. This makes more efficient use of an expensive resource (survey responses), and reduces the variance of the study conclusions.

Imagine we have used the supplemental survey questions to match one or more untreated respondents to each treated respondent, creating, say, 1,000 strata. There may be a variable number of respondents in each stratum. For simplicity of this example, we’ll assume there are 3 respondents in each stratum: 1 who saw the ad and 2 who didn’t.

Next we must choose an appropriate test statistic. The Mantel-Haenszel statistic is the number of strata where the person who saw the ad had positive brand perception. If TV advertising improves brand perception, many or most of the strata should have this pattern, and the test statistic will be high. Suppose there were 675 strata (out of 1,000) where the respondent who was exposed to the TV ad had positive brand perception, so the observed test statistic is 675.

If this were an experiment, we would identify the strata first. Then we would expose, at random, one unit in each stratum to TV advertising. This selection process would induce a known distribution on the test statistic. We would use that distribution to calculate p-values, point estimates, and confidence intervals.

Since this *isn’t* an experiment, a p-value is not an appropriate
measure of the evidence in the study. Instead we calculate the
*sensitivity,* using a large-sample approximation.

Whether we calculate a p-value or the sensitivity, we need to know the number of respondents in each stratum with positive brand perception. In our simple example with three respondents in each stratum, this is a number between 0 and 3, inclusive. The table below shows the number of strata having each pattern. The overall number of people with positive brand perception is the dot product of the two columns, 1,500, or 50% of respondents.

Number with Positive Brand Perception | Number of Strata |
---|---|

0 | 300 |

1 | 250 |

2 | 100 |

3 | 350 |

Total | 1000 |

Note 675 out of 1000, or 67.5% treated respondents had positive brand
perception, while 1,500 out of 3,000 respondents in total did. Then
825 out of 2,000, or 41.25% of untreated respondents had positive
brand perception. Taking this at face value, it suggests TV
advertising improved brand perception by about 26 percentage points, a
*huge* effect size. In an experiment, this would be an unbiased
estimate of the effect.

The code below explores a range of deviations from the ideal of a randomized experiment. It calculates the upper bound on the p-value corresponding to various hidden biases. The \( \Gamma = 1 \) case corresponds to no unobserved confounders; in this case, random chance cannot explain such a small p-value. If this were a randomized experiment, we would interpret this as strong evidence TV advertising improves brand perception.

```
>>> def pvalue_upper(Gamma):
... """Calculate upper-bound on p-value."""
... pi = [300, 250, 100, 350]
... t_obs = 675
... mu_u = (
... pi[1] * Gamma / (Gamma + 2)
... + 2 * pi[2] * Gamma / (2 * Gamma + 1)
... + pi[3]
... )
... sigma_u_2 = 2 * Gamma * (
... pi[1] / ((Gamma + 2) ** 2) +
... pi[2] / ((2 * Gamma + 1) ** 2)
... )
... z = (t_obs - mu_u) / math.sqrt(sigma_u_2)
... return scipy.stats.norm(0, 1).sf(z)
>>> for Gamma in range(1, 29, 3):
... print(
... "Gamma: {0:2d} -> p-value: {1:.03g}"
... .format(Gamma, pvalue_upper(Gamma))
... )
```

```
Gamma: 1 -> p-value: 6.32e-88
Gamma: 4 -> p-value: 4.54e-18
Gamma: 7 -> p-value: 5.98e-08
Gamma: 10 -> p-value: 0.000313
Gamma: 13 -> p-value: 0.0173
Gamma: 16 -> p-value: 0.135
Gamma: 19 -> p-value: 0.39
Gamma: 22 -> p-value: 0.663
Gamma: 25 -> p-value: 0.85
Gamma: 28 -> p-value: 0.943
```

Since this is *not* a randomized experiment, a p-value is not an
appropriate measure of the evidence in the study. Instead we note
somewhere between \( \Gamma = 13 \) and 16, the p-value passes the
standard significance threshold, \( \alpha = 0.05. \) It is
straightforward to calculate the exact value where this occurs using
bisection. The code below shows when \( \Gamma \approx 14.3, \) the
upper bound on the p-value is equal to \( 0.05. \) This value is the
*sensitivity* of the study, denoted \( \Gamma^\star. \)

```
>>> alpha = 0.05
>>> Gamma_star = scipy.optimize.bisect(
... lambda Gamma: pvalue_upper(Gamma) - alpha,
... 13, 16
... )
>>> print("{0:.01f}".format(Gamma_star))
```

```
14.3
```

A sensitivity analysis tells us the magnitude of confounding needed to invalidate the statistical significance of a finding. In this case, the confounding would have to make some people more than 14 times as likely to have seen the TV ad versus others in the same strata. Since we constructed strata to group similar people together, we would have to have neglected some pretty important factors for such skew to occur.

If this were an experiment, a 26 percentage point lift in perception would be an unbiased estimate of the effect size. Sensitivity analysis allows us to explore how hidden biases might skew these effect estimates, widening confidence intervals to reflect the extra uncertainty. Rosenbaum describes how to do so for binary outcomes in (Rosenbaum 2002), using the method of “asymptotic separability” developed in (Gastwirth, Krieger, and Rosenbaum 2000). I hope to understand this material soon!

## 4 Units, Potential Outcomes, and Treatment Effects

Suppose we have \( n \) units or individuals, with \( S \) units having selected treatment and the others acting as controls. We assume we matched each treated unit with \( n_s - 1 \geq 1 \) controls, creating \( S \) strata. There are \( n_s \) units in the \( s\textrm{th} \) stratum, exactly one of which selected treatment.

To adjust for observed covariates, we should choose the \( n_s - 1 \) controls in a stratum based on similarity to the treated unit. We won’t discuss here how we might create such strata. Instead we treat the strata as given.

We could label the treated unit as the first in each stratum, but it will prove useful to order units arbitrarily within strata, letting a variable indicate treatment status. Let \( Z_{si} = 1 \) if the \( i\textrm{th} \) unit in the \( s\textrm{th} \) stratum selects treatment and 0 otherwise. Because of the matching logic, \( \sum_{i=1}^{n_s} Z_{si} = 1 \) for each stratum.

We observe one or more covariates for each unit, which we represent with the vector, \( x_{si}. \) We assume we constructed the strata to achieve approximate, but perhaps not exact, homogeneity in each stratum. Let \( \chi_s = \frac{1}{n_s} \sum_{i=1}^{n_s} x_{si} \) be the vector of average covariate values in the \( s\textrm{th} \) stratum.

We also assume a single unobserved covariate, \( u_{si}, \) for each unit. Though there may be many unobserved factors relevant to the treatment selection and the outcome, we will see multiple factors are equivalent to a single unobserved factor.

Each unit has a pair of potential outcomes, \( (y_{si}(0), y_{si}(1)), \) where \( y_{si}(0) \) is the outcome the unit would exhibit if assigned to control and \( y_{si}(1) \) if assigned to test. The observed response is \( Y_{si}^\textrm{obs} = y_{si}(Z_{si}). \)

A sharp null hypothesis is any hypothesis allowing us to impute the counterfactual response, \( y_{si}(1 - Z_{si}). \) The sharp null hypothesis of no effect indicates \( y_{si}(1) = y_{si}(0) \) for all units. The model of a constant, additive effect proposes \( y_{si}(1) = y_{si}(0) + \tau \) for all units, while a constant, multiplicative effect purports \( y_{si}(1) = \tau \times y_{si}(0).\) More generally, a hypothesis may specify the treatment effect for each individual, \( \tau_{si} = y_{si}(1) - y_{si}(0). \)

People often balk at the idea of a null hypothesis, let alone a sharp one. What are the odds we would stumble upon the true hypothesis? The null hypothesis creates a world we can explore. We can explore this world without believing it. Rather, we use the conceit to identify the extent to which the data contradict the hypothesis. We may choose to reject some hypotheses, while retaining others as plausible. In this sense, a null hypothesis plays the same role as “assuming the negation” in a mathematical proof by contradiction. Null hypothesis testing offers a kind of “stochastic proof by contradiction” (Rubin 2004). I’ve written about this perspective previously.

We may adjust the observed response to remove the hypothesized
treatment effect from each treated unit, recovering the control
outcome, \( r_{si} := Y_{si}^\textrm{obs} - \tau_{si} \cdot Z_{si}. \)
Under the null hypothesis, the adjusted response is a fixed quantity,
\( r_{si} = y_{si}(0), \) not depending on the treatment selection.
This relationship is *only* true under the null hypothesis, not
generally, justifying the distinct notation. The adjusted response,
\( r_{si}, \) is something we can calculate based on observed
information (and the null hypothesis we consider). The potential
outcome, \( y_{si}(0), \) is something we observe only for the
untreated units, and can otherwise only speculate about.

We denote by \( Z_s \) the vector with components \( Z_{si}, \) and \( Z \) the vector concatenating each \( Z_s. \) Define \( u_s, \) \( u, \) \( r_s, \) and \( r \) similarly. Let \( x_s \) be the matrix whose \( i\textrm{th} \) row is \( x_{si} \) and let \( x \) be the matrix achieved by stacking the \( x_s \) matrices. Let \( \mathcal{Z} \) be the set of treatment selection patterns consistent with the strata structure: \( \mathcal{Z} \subset [0, 1]^n, \) and \( z \in \mathcal{Z} \Leftrightarrow \sum_{i=1}^{n_s} z_{si} = 1 \) for all \( s. \)

We have used lowercase symbols to denote fixed, nonrandom quantities, and capital symbols to denote random quantities. For example, we consider the potential outcomes \( (y_{si}(0), y_{si}(1)) \) fixed, while the observed response, \( Y_{si}^\textrm{obs}, \) is random because it depends on the treatment status. We will need to represent conditional probabilities, and so we use notation like \( \mathrm{Pr}\{ \cdot \, | \, X_{si} = x_{si} \} \) to denote a probability for a unit having a particular set of covariate values. This notation makes more sense if we think of the units in the study as a sample from a population, but our inferences come from the selection mechanism, not from a sampling model.

## 5 Test Statistics for Matched Sets

(Rosenbaum 1988)
considers test statistics in the *sign-score* family. We assign each
unit a binary *sign,* \( c_{si} \in (0, 1), \) allowed to depend on
the adjusted outcomes, \( r, \) but not directly on the treatment
selection. As a result, the signs are fixed (not random) quantities
under the null hypothesis.

We assign each stratum a non-negative *score,* \( q_s \geq 0, \)
allowed to depend on the adjusted outcomes but not directly on the
treatment selection. The scores are also fixed quantities under the
null hypothesis.

A sign-score test statistic has the form

\begin{equation}\label{eqn:sign_score} t(r, Z) = \sum_{s=1}^S q_s( r) \cdot \sum_{i=1}^{n_s} c_{si}( r) \cdot Z_{si}. \end{equation}

The Mantel-Haenszel statistic is appropriate for binary outcomes, setting \( q_s = 1 \) and \( c_{si} = r_{si}. \) This statistic is not limited to strata with a single treated unit, but in the case considered here, the test statistic is the number of strata where the treated unit exhibited a positive response.

For continuous outcomes, we might set \( q_s = 1 \) for all \( s, \) and \( c_{si} = 1 \) if the outcome for unit \( i \) exceeded the mean, or median, or 75th percentile of the outcomes in stratum \( s. \) (Requiring the outcome to exceed a higher threshold increases power against large effect sizes, where the treated unit may in fact be best in most strata, but may decrease power against small effect sizes. Choose based on domain knowledge and simulation!) The test statistic is the number of strata where the treated unit stood out as exceptional.

Anticipating the treatment may have a large effect on the upper tail by pushing a typical unit to exhibit an exceptional outcome, we might set \( q_s \) by ranking the strata according to the inter-quartile range. If the effect of treatment is to move a unit up above the 75th percentile, strata with larger interquartile ranges provide greater evidence of treatment effect, provided the treated unit is exceptional in those strata.

## 6 Strong Ignorability and The Distribution of Treatment Assignments

We say the treatment assignment is *strongly ignorable given*
\( X \) if (i) the potential outcomes are independent of the treatment
selection given \( X: \)
\[
(Y(0), Y(1)) \perp\!\!\!\perp Z \, | \, X,
\]
and (ii) all units have nonzero probability of experiencing either
treatment condition: \( 0 < \mathrm{Pr}\{ Z_{si} = 1 \, | \, X = x \} < 1 \)
for all \( x \) and all \( s \) and \( i. \)

Rather than assuming the study is indeed strongly ignorable given
\( X, \) we instead assume the study is strongly ignorable given
\( X \) **and** \( U. \) Moreover, we assume the selection probabilities
satisfy a logistic model:

\begin{align} \kappa(\chi_s) + u_{si} &= \log \frac{\mathrm{Pr}\{ Z_{si} = 1 \, | \, \Xi_s = \chi_s, U_{si} = u_{si} \}}{\mathrm{Pr}\{ Z_{si} = 0 \, | \, \Xi_s = \chi_s, U_{si} = u_{si} \}}. \nonumber \\ &= \log \frac{\mathrm{Pr}\{ Z_{si} = 1 \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U = u \}}{\mathrm{Pr}\{ Z_{si} = 0 \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U = u \}}. \label{eqn:logistic} \end{align}

This is a weak assumption. If there is a collection of unobserved factors, \( u^1, u^2, \ldots, \) rendering the treatment selection strongly ignorable, they are equivalent to a single unobserved factor. We have

\begin{align*} & \mathrm{Pr}\{ Z_{si} = 1 \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U^1 = u^1, U^2 = u^2, \ldots \} \\ &= \mathrm{Pr}\{ Z_{si} = 1 \, | \, X = x, U^1 = u^1, U^2 = u^2, \ldots \} \\ &= \mathrm{Pr}\{ Z_{si} = 1 \, | \, X_{si} = x_{si}, U_{si}^1 = u_{si}^1, U_{si}^2 = u_{si}^2, \ldots \} \\ &=: \pi(x_{si}, u_{si}^1, u_{si}^2, \ldots), \end{align*}

where the second line follows from strong ignorability, and the third line assumes units select treatment solely on the basis of their own covariates. By strong ignorability, \( \pi(x_{si}, u_{si}^1, u_{si}^2, \ldots) \) is strictly between 0 and 1 for all argument values.

Define \[ \eta(x_{si}, u_{si}^1, u_{si}^2, \ldots) := \log \frac{ \pi(x_{si}, u_{si}^1, u_{si}^2, \ldots)}{1 - \pi(x_{si}, u_{si}^1, u_{si}^2, \ldots)}, \] and define \( \kappa(x_{si}) \) to be any approximation to \( \eta(x_{si}, u_{si}^1, u_{si}^2, \ldots) \) depending only on \( x_{si}, \) such as described here. Define

\begin{align*} u_{si} &= \eta(x_{si}, u_{si}^1, u_{si}^2, \ldots) - \kappa(\chi_s) \\ &= [\eta(x_{si}, u_{si}^1, u_{si}^2, \ldots) - \kappa(x_{si})] + [\kappa(x_{si}) - \kappa(\chi_s)]. \end{align*}

Then the selection probabilities automatically satisfy Equation (\ref{eqn:logistic}).

We expect \( u_{si} \) to be small whenever \( \kappa(\chi_s) \approx \kappa(x_{si}), \) which happens when the covariates are approximately homogeneous within strata, and when \( \kappa(x_{si}) \) is a good approximation to \( \eta; \) that is, when the observed covariates explain most of the variance in the treatment selection.

Conditioning on the strata structure eliminates \( \kappa(\chi_s) \) from the model:

\begin{align*} & \mathrm{Pr}\left\{ Z_{si} = 1 \, \middle| \, \sum_{j=1}^{n_s} Z_{sj} = 1, Y(0) = y(0), Y(1) = y(1), X = x, U = u \right\} \\ &= \mathrm{Pr}\left\{ Z_{si} = 1 \textrm{ and } Z_{sj} = 0 \, \forall \, j \neq i \, \middle| \, X = x, U = u \right\} \\ &\phantom{=} \hspace{20pt} \left/ \mathrm{Pr}\left\{ \sum_{j=1}^{n_s} Z_{sj} = 1 \, \middle| \, X = x, U = u \right\} \right. \\ & = \mathrm{Pr}\left\{ Z_{si} = 1 \, \middle| \, X = x, U = u \right\} \times \Pi_{j \neq i} \mathrm{Pr}\left\{ Z_{sj} = 0 \, \middle| \, X = x, U = u \right\} \\ &\phantom{=} \hspace{20pt} \left/ \sum_{k=1}^{n_s} \left[ \mathrm{Pr}\left\{ Z_{sk} = 1 \, \middle| \, X = x, U = u \right\} \right. \right. \\ &\phantom{=} \hspace{60pt} \left. \times \Pi_{j \neq k} \mathrm{Pr}\left\{ Z_{sj} = 0 \, \middle| \, X = x, U = u \right\} \right]\\ &= \frac{\exp(\kappa(\chi_s) + u_{si})}{1 + \exp(\kappa(\chi_s) + u_{si})} \times \Pi_{j \neq i} \frac{1}{1 + \exp(\kappa(\chi_s) + u_{sj})} \\ &\phantom{=} \hspace{20pt} \left/ \sum_{k=1}^{n_s} \frac{\exp(\kappa(\chi_s) + u_{sk})}{1 + \exp(\kappa(\chi_s) + u_{sk})} \times \Pi_{j \neq k} \frac{1}{1 + \exp(\kappa(\chi_s) + u_{sj})} \right. \\ &= \frac{\exp(\kappa(\chi_s) + u_{si}) / \Pi_{j=1}^{n_s} (1 + \exp(\kappa(\chi_s) + u_{sj}))}{\sum_{k=1}^{n_s} \exp(\kappa(\chi_s) + u_{sk}) / \Pi_{j=1}^{n_s} (1 + \exp(\kappa(\chi_s) + u_{sj}))} \\ &= \frac{\exp(\kappa(\chi_s)) \cdot \exp(u_{si})}{\exp(\kappa(\chi_s)) \cdot \sum_{k=1}^{n_s} \exp(u_{sk})} \\ &= \frac{\exp(u_{si})}{\sum_{k=1}^{n_s} \exp(u_{sk})} = \mathrm{Pr}\left\{ Z_{si} = 1 \, \middle| \, \sum_{j=1}^{n_s} Z_{sj} = 1, U_s = u_s \right\}. \end{align*}

Henceforth we omit the explicit conditioning on the strata structure, writing \[ \mathrm{Pr}\{ Z_{si} = 1 \, | \, U_s = u_s \} = \frac{\exp(u_{si})}{\sum_{k=1}^{n_s} \exp(u_{sk})} \]

Note if \( z \in \mathcal{Z} \) and \( z_{si} = 1, \) then

\begin{align*} \mathrm{Pr}\{Z_s = z_s \, | \, U_s = u_s\} &= \frac{\exp(u_{si})}{\sum_{k=1}^{n_s} \exp(u_{sk})} \\ &= \frac{\exp(\sum_{k=1}^{n_s} z_{sk} \cdot u_{sk})}{\sum_{k=1}^{n_s} \exp(u_{sk})} \end{align*}

and \[ \mathrm{Pr}\{Z = z \, | \, U = u \} = \Pi_{s=1}^S \frac{\exp(\sum_{k=1}^{n_s} z_{sk} \cdot u_{sk})}{\sum_{k=1}^{n_s} \exp(u_{sk})}. \]

## 7 The p-value is Increasing in the Unobserved Covariates

The test statistic has the form given in Equation (\ref{eqn:sign_score}). Let \( B_s(r, Z) = \sum_{i=1}^{n_s} c_{si}( r) \cdot Z_{si}, \) so \( t(r, Z) = \sum_{s=1}^S q_s( r) \cdot B_s(r, Z). \)

\( B_s \) is binary, so

\begin{align} & \mathrm{Pr}\{B_s = 1 \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U = u \} \nonumber \\ &= \mathbf{E}[B_s \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U = u] \nonumber \\ &= \sum_{i=1}^{n_s} c_{si} \cdot \mathbf{E}[Z_{si} \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U = u] \nonumber \\ &= \sum_{i=1}^{n_s} c_{si} \cdot \mathrm{Pr}\{Z_{si} = 1 \, | \, Y(0) = y(0), Y(1) = y(1), X = x, U = u\} \nonumber \\ &= \frac{\sum_{i=1}^{n_s} c_{si} \cdot \exp(u_{si})}{\sum_{i=1}^{n_s} \exp(u_{si})} = \mathrm{Pr}\{B_s = 1 \, | \, U_s = u_s \}. \label{eqn:b_s_expectation} \end{align}

This is a weighted average over the signs in the stratum, using \( \exp(u_{si}) \) as the weights. We next show \( B \) is stochastically increasing in \( U \) with respect to the pre-order, \( \preceq_U, \) defined as: \( u \preceq_U u^\prime \) if, for all elements having \( c_{si} = 1, \) \( u_{si} \leq u_{si}^\prime, \) and for all elements having \( c_{si} = 0, \) \( u_{si}^\prime \leq u_{si}. \)

Recall \( B \) is stochastically increasing in \( U \) with respect to \( \preceq_U \) if \( \mathbf{E}[f(B) \, | \, U=u] \) is increasing in \( u \) (with respect to \( \preceq_U \)) for every \( f \) increasing. (Ahmed, Leon, and Proschan 1981, Lemma 3.3) provides an approach for proving this: \( B_1, \ldots, B_S \) must be conditionally independent given \( U \) (which follows from independent treatment selection across strata) and each \( B_s \) must be stochastically increasing in \( U. \)

Since \( B_s \) is binary, we need only consider binary functions: \( f(B_s) = a_0 \) if \( B_s = 0 \) and \( a_1 \) otherwise. Since \( f \) is increasing, \( a_0 \leq a_1. \) Then

\begin{align*} \mathbf{E}[f(B_s) \, | \, U = u] &= \mathbf{E}[f(B_s) \, | \, U_s = u_s] \\ &= a_0 \cdot \mathrm{Pr}\{ B_s = 0 \, | \, U_s = u_s \} + a_1 \cdot \mathrm{Pr}\{ B_s = 1 \, | \, U_s = u_s \} \\ &= a_0 + (a_1 - a_0) \cdot \mathrm{Pr}\{ B_s = 1 \, | \, U_s = u_s \}. \end{align*}

Now we need only show if \( u \preceq_U u^\prime, \) then \( \mathbf{E}[f(B_s) \, | \, U = u] \leq \mathbf{E}[f(B_s) \, | \, U = u^\prime], \) or

\begin{align*} & a_0 + (a_1 - a_0) \cdot \mathrm{Pr}\{ B_s = 1 \, | \, U_s = u_s \} \\ &\leq a_0 + (a_1 - a_0) \cdot \mathrm{Pr}\{ B_s = 1 \, | \, U_s = u_s^\prime \}. \end{align*}

Recall \( a_0 \leq a_1 \) since \( f \) is increasing. The relation is trivial when \( a_0 = a_1, \) so assume \( a_0 < a_1. \) The relation is equivalent to \( \mathrm{Pr}\{ B_s = 1 \, | \, U_s = u_s \} \leq \mathrm{Pr}\{ B_s = 1 \, | \, U_s = u_s^\prime \}. \)

Recall from Equation (\ref{eqn:b_s_expectation}) that \( \mathrm{Pr}\{B_s = 1\, | \, U_s = u_s \} \) is a weighted average of the signs in stratum \( s, \) with weights determined by \( u. \) When \( u \preceq_U u^\prime, \) \( u^\prime \) puts more weight on the entries with \( c_{si} = 1 \) and less weight on the entries with \( c_{si} = 0, \) which establishes the relationship.

We have shown \( B \) is stochastically increasing in \( U \) with respect to \( \preceq_U. \) As a result, for any increasing \( f, \) if \( u \preceq_U u^\prime, \) then \( \mathbf{E}[ f(B) \, | \, U=u] \leq \mathbf{E}[ f(B) \, | \, U=u^\prime]. \) Consider \[ f_k(B) = I\left\{ \sum_{s=1}^S q_s \cdot B_s \geq k \right\}. \] Then \( f_k \) is increasing in \( B \) since \( q_s \geq 0. \) As a result, \[ \mathbf{E}[ f_k(B) \, | \, U=u] = \mathrm{Pr}\{ t(r, Z) \geq k \, | \, U=u \} \leq \mathrm{Pr}\{ t(r, Z) \geq k \, | \, U=u^\prime \} \] for all \( k. \) When \( k \) equals the observed value of the test statistic, this is the one-sided p-value against the null hypothesis. Thus, the p-value is increasing in the unobserved covariate.

## 8 Extreme Configurations

If we knew the unobserved covariates were bounded in some way, we
could calculate bounds on the p-value. The unobserved
covariates may *not* be bounded. Still, let’s explore bounds of the
form: \( -\frac{\gamma}{2} \leq u_{si} \leq \frac{\gamma}{2}. \) Letting
\( \gamma \to \infty \) invokes no limitation on \( U. \)

Such bounds have an interpretation in terms of relative treatment probabilities. For two units in the same stratum, \[ -\gamma \leq u_{si} - u_{sj} = \log \frac{\mathrm{Pr}\{ Z_{si} = 1 \, | \, U_s = u_s \}}{\mathrm{Pr}\{ Z_{sj} = 1 \, | \, U_s = u_s \}} \leq +\gamma. \] Equivalently, the probability of one unit selecting treatment rather than another, within strata, is bounded by \( 1/\Gamma \) and \( \Gamma, \) where \( \Gamma = \exp(\gamma). \)

Let \( u^\ell \) be defined as having entries \( u_{si}^\ell = (1 - 2 \cdot c_{si}) \cdot \gamma / 2 \) and let \( u^u \) be defined as having entries \( u_{si}^u = ( 2 \cdot c_{si} - 1) \cdot \gamma / 2 = -u_{si}^\ell. \) One can show \( u^\ell \preceq_U u \preceq_U u^u \) for any \( u \) satisfying the specified bounds.

Then

\begin{align*} \mathrm{Pr}\{ t(r, Z) \geq k \, | \, U=u^\ell \} &\leq \mathrm{Pr}\{ t(r, Z) \geq k \, | \, U=u \} \\ &\leq \mathrm{Pr}\{ t(r, Z) \geq k \, | \, U=u^u \} \end{align*}

and the p-value is bounded by the extremes corresponding to \( u^\ell \) and \( u^u. \)

Let

\begin{align*} p^u(\gamma) &:= \mathrm{Pr}\{ t(r, Z) \geq t_\textrm{obs} \, | \, U=u^u \} \\ &= \sum_{z \in \mathcal{Z}} I\{ t(r, z) \geq t_\textrm{obs} \} \cdot \mathrm{Pr}\{ Z = z \, | \, U=u^u \} \\ &= \sum_{z \in \mathcal{Z}} I\{ t(r, z) \geq t_\textrm{obs} \} \cdot \Pi_{s=1}^S \frac{\exp(\sum_{k=1}^{n_s} z_{sk} \cdot (2 \cdot c_{sk} - 1) \cdot \gamma / 2)}{\sum_{k=1}^{n_s} \exp((2 \cdot c_{sk} - 1) \cdot \gamma / 2)} \\ &= \sum_{z \in \mathcal{Z}} I\{ t(r, z) \geq t_\textrm{obs} \} \cdot \Pi_{s=1}^S \frac{\exp(-\gamma/2) \cdot \exp(\gamma \cdot \sum_{k=1}^{n_s} z_{sk} \cdot c_{sk} )}{\sum_{k=1}^{n_s} \exp(-\gamma/2) \cdot \exp(\gamma \cdot c_{sk})} \\ &= \sum_{z \in \mathcal{Z}} I\{ t(r, z) \geq t_\textrm{obs} \} \cdot \Pi_{s=1}^S \frac{\exp(\gamma \cdot \sum_{k=1}^{n_s} z_{sk} \cdot c_{sk} )}{\sum_{k=1}^{n_s} \exp(\gamma \cdot c_{sk})}. \end{align*}

Define \( p^\ell(\gamma) \) similarly. One can show \( p^u(\gamma) \to 1 \) and \( p^\ell(\gamma) \to 0 \) as \( \gamma \to \infty: \) without bounds on the unobserved factors, no observational study provides evidence of a treatment effect.

## 9 The Sensitivity of an Observational Study

Observational studies often report a p-value, implicitly claiming the adjustments for observed covariates remove all bias. This p-value equals \( p^u(0). \) Assuming this p-value is below the threshold for significance, \( \alpha, \) there is some maximum value of \( \gamma \) having \( p^u(\gamma) \leq \alpha. \) Call this value, \( \gamma^\star. \) As long as the unobserved factors lie inside the bounds implied by \( \gamma^\star, \) we would still reject the null hypothesis, and the study conclusions would remain essentially the same. If the unobserved factors deviate from these bounds, the conclusions would be in question.

Even if \( \gamma^\star \) is large, remaining biases may be larger. There is no meaningful threshold for \( \gamma^\star \) guaranteeing a reliable conclusion. Then what purpose does \( \gamma^\star \) serve?

*Quantifying* sensitivity to hidden biases is the first step to
*managing* sensitivity to hidden biases. Just as many factors affect
the statistical power of a randomized experiment, such as sample size,
effect size, study design, and test statistic, so too do these and
other factors influence sensitivity to hidden biases
(Rosenbaum 2011). Knowing how to quantify sensitivity,
we can **design** an observational study to reduce sensitivity to hidden
biases (Rosenbaum 2020).

Domain expertise may reduce the plausibility of large unobserved biases. Rosenbaum calls these “quasi-experimental devices” (Rosenbaum 1984, 1987b, 1989). Combined with a study having a large \( \gamma^\star, \) these devices build confidence in the study conclusions.

What about a study that is not statistically significant, even ignoring the unobserved covariates? This is equivalent to \( p^u(0) > \alpha. \) It may be wishful thinking, but the unobserved covariates may conceal a true effect, as easily as exaggerating a trivial one. There is some minimum amount of skew the unobserved covariates would need to introduce for this to happen.

We can incorporate this idea into the definition of sensitivity as well, defining \[ \gamma^\star = \begin{cases} \sup\{ \gamma : p^u(\gamma) \leq \alpha \} & \textrm{if } \mathrm{Pr}\{ t(r, Z) \geq t_\textrm{obs} \, | \, U=0 \} \leq \alpha \\ -\inf\{ \gamma : p^\ell(\gamma) \leq \alpha \} & \textrm{otherwise.} \end{cases} \]

Rosenbaum prefers to express sensitivity in terms of \( \Gamma^\star = \exp(\gamma^\star). \)
Larger values of \( \Gamma^\star \) represent stronger evidence. When
\( \Gamma^\star > 1, \) the unadjusted p-value, which assumes no
hidden biases, is below the threshold for significance. Small
remaining biases cannot invalidate the study conclusions.
When \( \Gamma^\star < 1, \) hidden biases would have to *conceal* a
treatment effect to be plausible, weak evidence indeed!

## 10 The Capacity of an Observational Study

The smallest \( p^u(\gamma) \) can be, for any value \( \gamma, \) is
when there is just one unit in each stratum with \( c_{si} = 1, \)
and when that unit is the treated unit. In that case,
\( B_s = \sum_{i=1}^{n_s} c_{si} \cdot z_{si} = 1 \) in each stratum,
and that is the *only* such configuration of treatment assignments.
Since the test statistic is increasing in \( B, \) the observed test
statistic achieves its largest possible value, so

\begin{align*} p_\textrm{min}^u(\gamma) &= \Pi_{s=1}^S \frac{\exp(\gamma)}{\exp(\gamma) + (n_s - 1)} \\ &= \Pi_{s=1}^S \frac{1}{1 + (n_s - 1) \cdot \exp(-\gamma)} \\ &= \left( \Pi_{s=1}^S \left( 1 + (n_s - 1) \cdot \Gamma^{-1} \right) \right)^{-1}. \end{align*}

This formula is increasing in \( \Gamma. \) When \( \Gamma = 1 \) (corresponding to \( \gamma = 0 \) and no unobserved confounders), \( p_\textrm{min}^u(\gamma) = 1 / \Pi_s n_s, \) which is presumably less than the significance threshold, \( \alpha, \) for any reasonable sample size. As \( \gamma \to \infty, \) \( p_\textrm{min}^u(\gamma) \to 1, \) just as we saw before. There is some intermediate value where \( p_\textrm{min}^u(\gamma) = \alpha. \)

Above this value, it would be impossible to guarantee a statistically
significant result, considering all possible confounder values. No
matter how strong the observed result, it could always be explained by
a suitably chosen arrangement of unobserved covariates. Call this
threshold the *capacity* of the study.

With arbitrary strata sizes, it would be straightforward to calculate
the capacity of a study using bisection. When the number of units in
each strata is constant, there is a simple formula:
\[
\mathrm{Capacity} = \frac{n_s - 1}{\alpha^{-1/S} - 1}.
\]
Compare with Equation (
9) of our post on matched pairs, equivalent to
\( n_s = 2. \) This provides one simple and direct insight into the
value of matching with multiple controls: it increases the capacity of
the study. Even going from 1 control to 2 (\( n_s = 2 \to 3 \))
*doubles* the capacity of the study.

## 11 Sensitivity Analysis with Large Sample Sizes

When the number of strata, \( S, \) is large enough to justify application of the central limit theorem, we may use a normal approximation to the test statistic in evaluating bounds on significance levels. (Convergence to normality will be slower when some values of \( q_s \) are much larger than others.)

Then

\begin{align*} \mu_u(\gamma) &= \mathbf{E}\left[ t(r, Z) \, | \, U = u^u \right] \\ &= \sum_{s=1}^S q_s( r) \cdot \mathbf{E}\left[ B_s(r, Z) \, | \, U = u^u \right] \\ &= \sum_{s=1}^S q_s \cdot \frac{\sum_{i=1}^{n_s} c_{si} \cdot \exp(c_{si} \cdot \gamma)}{\sum_{i=1}^{n_s} \exp(c_{si} \cdot \gamma)} \\ &= \sum_{s=1}^S q_s \cdot \rho_s^u(\gamma), \end{align*}

where \( \rho_s^u(\gamma) := \sum_{i=1}^{n_s} c_{si} \cdot \exp(c_{si} \cdot \gamma) / \sum_{i=1}^{n_s} \exp(c_{si} \cdot \gamma). \)

Since \( B_s \) is binary, \[ \sigma_u^2(\gamma) = \sum_{s=1}^S q_s^2 \cdot \rho_s^u(\gamma) \cdot ( 1 - \rho_s^u(\gamma)). \] Then \( (t - \mu_u(\gamma)) / \sigma_u(\gamma) \) has an asymptotic distribution stochastically smaller than an \( \mathcal{N}(0, 1) \) distribution. Thus, we may use \[ 1 - \Phi\left( \frac{t_\textrm{obs} - \mu_u(\gamma)}{\sigma_u(\gamma)} \right) \] to approximate \( p^u(\gamma). \)

## 12 When are Observational Studies Persuasive?

Rosenbaum often cites a meta-analysis of the effects of smoking
(Hammond 1964), influential in the path towards medical
consensus that smoking causes lung cancer. Sensitivity analysis for
observational studies was in its infancy in 1964, and the
\( \Gamma^\star \) notation was not used in Hammond’s paper.
Rosenbaum reports the study was equivalent to \( \Gamma^\star \approx 6 \)
(Rosenbaum 1987a, sec. 3.2).
Rosenbaum uses this example to point out observational studies *can*
be persuasive. There is no doubt today smoking causes lung cancer, yet
there has never been a randomized experiment.

A large value of \( \Gamma^\star, \) alone, was not the only reason to believe the observational studies. Randomized experiments on mice had demonstrated a causal association between smoking and cancer (in mice). Human autopsies revealed lung damage among smokers, regardless of whether they died of lung cancer (Rosenbaum 2019, pg. 125). The medical community felt confident they understood demographic and behavioral details predictive of smoking. Researchers adjusted for these factors in their studies. Studies involving different groups of people an+d conducted by different teams consistently found an association between smoking and cancer.

Yet some, including Sir Ronald Fisher, dismissed this extensive
evidence as inadequate (Stolley 1991). Critics argued an
observational study can *never* be persuasive, because we can never be
sure we have included all relevant factors in the analysis.
(Cornfield et al. 1959) created limits for this kind of
criticism, pointing out it wasn’t enough for unobserved factors to
*exist.* To invalidate a causal finding, they had to be sufficiently
predictive of both treatment and outcome. The larger this threshold,
and the better understood the treatment selection mechanism, the less
plausible the criticism. Rosenbaum cites Cornfield’s study as the
first sensitivity analysis.

Claiming adjustments for observed covariates remove all bias from an observational study is naive. Automatically dismissing as irrelevant the evidence in an observational study is vacuous. A sensitivity analysis provides a middle ground. A large \( \Gamma^\star \) warrants serious consideration but is not persuasive by itself. A study must also convince its audience the remaining biases are small. For that, we need to understand the treatment selection process, and include quasi-experimental devices like multiple control groups (Rosenbaum 1987b) and known effects (Rosenbaum 1989).

## 13 Summary

A well-executed randomized experiment provides an unbiased estimate of the effect of a treatment on an outcome. A p-value quantifies the “strength-of-evidence” in such a study.

Observational studies feature greater uncertainty from the possible influence of unobserved factors. While we can reduce bias from observed covariates, through matching, weighting, or regression adjustment, we cannot adjust for unobserved covariates.

Adjusting for observed factors does *not* elevate an observational
study to the reliability of an experiment. P-values are *not*
appropriate measures of the strength of evidence in an observational
study. Instead, sensitivity analysis allows us to identify the
magnitude of hidden biases that would be necessary to invalidate study
conclusions. This leads to a strength-of-evidence metric appropriate
for an observational study.

## 14 References

*The Annals of Statistics*9 (1). Institute of Mathematical Statistics: pg. 168–76. http://www.jstor.org/stable/2240880.

*Journal of the National Cancer Institute*22 (1). Oxford University Press: pg. 173–203.

*Journal of the Royal Statistical Society. Series B (Statistical Methodology)*62 (3). [Royal Statistical Society, Wiley]: pg. 545–55. http://www.jstor.org/stable/2680696.

*Journal of the National Cancer Institute*32 (5). Oxford University Press: pg. 1161–88.

*Journal of the American Statistical Association*. JSTOR, pg. 41–48.

*Looking Back: Proceedings of a Conference in Honor of Paul W. Holland*, , pg. 87–114. Springer.

*Biometrika*74 (1). [Oxford University Press, Biometrika Trust]: pg. 13–26. http://www.jstor.org/stable/2336017.

*Statistical Science*2 (3). Institute of Mathematical Statistics: pg. 292–306. doi:10.1214/ss/1177013232.

*Biometrika*75 (3). [Oxford University Press, Biometrika Trust]: pg. 577–81. http://www.jstor.org/stable/2336609.

*Journal of the American Statistical Association*97 (457). [American Statistical Association, Taylor & Francis, Ltd.]: pg. 183–92. http://www.jstor.org/stable/3085773.

*Journal of Educational and Behavioral Statistics*29 (3). [American Educational Research Association, Sage Publications, Inc., American Statistical Association]: pg. 343–67. http://www.jstor.org/stable/3701358.