# Attributable Effects

In a previous post, we discussed why randomization provides a “reasoned basis for inference” in an experiment. Randomization creates well-defined potential outcomes and quantifies the “strength of evidence” in an experiment, generalizing the mathematical technique of proof-by-contradiction. While randomization cannot prove causation to the level of logical certainty, it is a powerful tool for investigating causal relationships.

Randomization not only quantifies the plausibility of a causal effect but also allows us to infer something about the size of that effect. In this post, I will focus on a particular setting where effect size estimates are justified solely by randomization, requiring no models or assumptions. This post largely draws from (Rosenbaum 2001).

## Binary Outcomes and the Sharp Null Hypothesis of No Effect

Throughout this post, we will consider only binary outcomes. Binary outcomes are common in industry. A person either purchases or not; they retain or not. We will first consider the plausibility of no effect whatsoever before considering other possibilities.

Suppose you have just finished an A/B test involving \( N \) people, \( n_T \) of whom were selected to receive a treatment. Among those who received treatment, \( s_T \) exhibited a positive response (that is, they bought something, or they retained, whatever). Among the \( n_C := N - n_T \) who did not receive the treatment, \( s_C \) exhibited a positive response.

The results of this experiment can be summarized in a *contingency
table*.

Group | Successes | Trials |
---|---|---|

Control | \( s_C \) | \( n_C \) |

Treatment | \( s_T \) | \( n_T \) |

Total | \( S \) | \( N \) |

I like working with real numbers, so I will use the following example throughout. Two thousand people are identified as a candidate audience for a marketing campaign. One thousand are selected at random to be held back from marketing (the control group); the other thousand will be exposed to the campaign (treatment). Among the control group, one hundred people purchased the product being marketed. In the treatment group, one hundred and thirty purchased.

Group | Successes | Trials |
---|---|---|

Control | 100 | 1,000 |

Treatment | 130 | 1,000 |

Total | 230 | 2,000 |

*In most real A/B tests, people are assigned treatment independently
and with equal probability, so the number of people assigned to
treatment (or control) is itself random. But our sample sizes are
typically so large the variation in group sample sizes is typically
negligible. Henceforth I will treat \( n_C \) and \( n_T \) (and
\( N \)) as fixed.*

With the hard numbers in the second table, we see we have equal sample
sizes in both groups, and more successes in the treatment group than
in the control group, so it looks like the treatment may have a
positive effect. *(Nothing in this approach requires equal sample
sizes, or positive effects, I just like working with hard numbers.)*

Suddenly a mysterious stranger appears. He whispers, “The treatment had no effect. Not one single person had their purchase behavior influenced by marketing. This is all just an illusion.”

You don’t like mysterious strangers in general, and this particular mysterious stranger seems especially arrogant, thinking he knows why anyone behaves the way they do, so you decide to take him down a notch. You say:

*“Okay stranger. If what you’re saying is true, then really there are
only two types of people. Those who purchase and those who don’t.*

*“Those 130 people who bought in the treatment group? You’re saying
they would have bought even had they been assigned to control. In
fact, no matter how we assigned people to treatment, the 230 who
purchased, would still have purchased. And the others would not have
purchased.*

*“All we did when we assigned 1,000 people to treatment, was select
1,000 at random from a known set of people. This set has 2,000 people,
230 of whom had already made up their minds to purchase. The remainder
had already made up their minds not to purchase, and not one person
could have their minds changed by marketing. It’s just a coincidence
130 of the purchasers were selected for the treatment group.*

*“And since this is exactly the characterization of a hypergeometric
distribution, if what you’re saying is true, then the number of
successes observed in the treatment group has a known distribution.”*

You proceed to open up python and calculate:

```
>>> from scipy.stats import hypergeom
>>> [N, n_T, s_T, s_C] = [2000, 1000, 130, 100]
>>> rv = hypergeom(N, s_C + s_T, n_T)
>>> rv.sf(s_T - 1) # - 1 since sf is > but p-value is >=
0.020952274191867567
```

After contemplating the pros and cons of one-sided vs two-sided tests, you decide to double the result and call it a p-value against the two-sided null hypothesis of no effect.

```
>>> 2 * rv.sf(s_T - 1)
0.041904548383735134
```

*(In general, you can perform a two-sided test by conducting two
one-sided tests, of inferiority and superiority, then doubling the
smaller p-value and truncating at 1
(Cox et al. 1977; Rosenbaum 2019, sec. 3, Footnote 3). When the
observed outcome is larger than expected under H0, the test of
superiority leads to the smaller p-value, so we can just double that
one. When in doubt, calculate:*

```
>>> min(1.0, 2 * min(rv.cdf(s_T), rv.sf(s_T - 1))
0.041904548383735134
```

*This annoying formula handles the discreteness of the hypergeometric
distribution.)*

While this does not disprove the stranger’s claim to the level of
logical certainty, you know that **nothing** can do that. The
**possibility** of no effect can never be disproven. And so you merely
reject the stranger’s claim as implausible, and he vanishes in a poof
of smoke. Only then do you realize the stranger was really just a
framing device to summarize the main points of the previous post.

These main points are:

- We can
**never**disprove the null hypothesis of no effect to the level of logical certainty. - A sharp null hypothesis is not something we believe or disbelieve. Rather, it creates a world we can explore. In combination with the random assignment of people to treatment, a sharp null hypothesis allows us to make probabilistic statements about our observations. The more unlikely our observations, the more implausible the null hypothesis. Yet this probabilistic approach is never definitive.
- If our observations are not particularly unlikely, in no sense does this constitute evidence the null hypothesis is true. A null hypothesis can never be proven true; we can only fail to reject it as false.

## Other Sharp Null Hypotheses

The role of any null hypothesis is to create a world we can explore. As Stephen King once wrote, “There are other worlds than these.” There is nothing special about the hypothesis of no effect.

As if to prove it, a dragon appears out of nowhere. The dragon says,
“That other stranger was a charlatan, but I know what’s **really** going
on in this test. I know the counterfactual outcomes for each person in
the test. It would take a long time to list them out one by one, so
instead I’m just going to summarize the control outcomes in a
contingency table.”

Group | Successes under Control | Trials |
---|---|---|

Control | 100 | 1,000 |

“Treatment” | 110 | 1,000 |

Total | 210 | 2,000 |

As you start to explore this world the dragon created for you, you notice there are 20 fewer successes in the treatment group than there were before. The dragon is alleging that, had the treatment group been instead held back from marketing, there would have been 20 fewer purchases. (This second row isn’t really a “treatment group” then but rather a second control group. But, it’s the same people as what we’ve been calling the treatment group, so we’ll maintain the label.) That is, the dragon is alleging that marketing caused 20 purchases.

“If marketing caused only 20 purchases, then why are there still more
purchases in the treatment group than in control?” you ask
suspiciously. But we need not *believe* a null hypothesis in order to
*explore* it, so let’s continue.

You next notice the first row of the contingency table is the same as
before. If it were otherwise, you could immediately dismiss the
dragon’s claim. The dragon claims this is the contingency table of
control outcomes, but you *observe* the control outcomes in the
control group. If the dragon said anything else, you could disprove
his claim (to the level of logical certainty). Only claims involving
counterfactual outcomes require speculation!

You realize there are in principle \( 2^{n_T} \) possible counterfactual outcomes for the treatment group, but there is a lot of redundancy. Everyone exhibiting a positive response in the treatment group is identical from the perspective of a contingency table, as are those exhibiting a negative response. A claim about counterfactual outcomes for the treatment group is effectively a claim about how many observed successes would have been failures, had the individuals not received treatment, and how many observed failures would have been successes. There are \( (s_T + 1) \times (n_T - s_T + 1),\) not \( 2^{n_T} \) distinct hypotheses.

But only the *net* effect shows up in the contingency table of alleged
control outcomes. The hypotheses:

- 30 successes in the treatment group would have been failures, and 10 failures would have been successes, had this group not received treatment; and,
- 20 successes in the treatment group would have been failures, but no failures would have been successes,

both involve a net 20 fewer successes under control. Both correspond to an alleged 20 purchases caused by marketing. Both hypotheses lead to the same contingency table of control outcomes. So there is still redundancy among the \( (s_T + 1) \times ( n_T - s_T + 1) \) hypotheses. Two hypotheses corresponding to the same net effect lead to the same contingency table.

*(Rosenbaum 2001) uses the
term “attributable effect” instead of “net effect” since it is the
number of successes in the treatment group that were caused by, or
attributable to, the treatment.*

There are only \( n_T + 1\) distinct hypotheses then. The net effect can involve at most \( s_T \) successes becoming failures (for a contingency table with a 0 entry), or at most \( n_T - s_T \) failures becoming successes, plus a net effect of 0, for a total of \( s_T + (n_T - s_T) + 1= n_T + 1 \) possible net effects.

In this table of control outcomes, there really are only two types of people, according to their response under control. So if the purported net effect is correct, then the number of (control) successes in the treatment group follows a hypergeometric distribution. We calculate a two-sided p-value against the hypothesis that the treatment led to 20 additional successes:

```
>>> a0 = 20
>>> rv = hypergeom(N, s_C + s_T - a0, n_T)
>>> min(1.0, 2 * min(rv.cdf(s_T - a0), rv.sf(s_T - a0 - 1))
0.5115930741739885
```

The claim is plausible, at least from the perspective of the observations. We might have other reasons for considering the claim implausible, such as dragons being generally untrustworthy, but the data alone are insufficient to reject the possibility. We could repeat this exercise for each candidate net effect, from \( -s_T \) to \( +(n_T - s_T), \) retaining the effects with p-values above, say 0.05. That gives us a 95% confidence interval on the net effect.

A net effect of 1, corresponding to 1 purchase caused by marketing, has two-sided p-value 0.049, while a net effect of 2 has p-value 0.057. So we reject a net effect of 1 as (barely) implausible, but retain a net effect of 2 as plausible.

A net effect of 55 has p-value 0.057, while a net effect of 56 has p-value 0.047. So we retain a net effect of 55 as plausible, but reject 56 as being inconsistent with the data. A 95% confidence interval has endpoints 2 and 55.

We don’t even need to evaluate \( n_T + 1 \) hypotheses in order to calculate a confidence interval. In the next section, we will show how calculating a point estimate is just simple algebra. The point estimate by definition is consistent with the data and thus is within the confidence interval.

We know the net effect is at most \( s_T, \) so bisection may be used to find the largest integer between the point estimate and \( s_T \) with a p-value at least equal to 0.05. This is the upper bound on a 95% confidence interval. Similarly, we know the net effect is at least \( -(n_T - s_T), \) so bisection may be used to find the lower bound on the interval. This bisection runs in \( \mathcal{O}(\log n_T ) \) time.

We started off with the daunting task of considering \( 2^{n_T} \) hypotheses, but in fact only need to consider \( \mathcal{O}(\log n_T ) \) hypotheses to calculate a point estimate and confidence interval.

## The Hodges-Lehmann Trick

(Hodges and Lehmann 1963) describe a “trick” for calculating a point estimate based on a hypothesis test. This point estimate is the value that leads to the test statistic having its expected value under the null hypothesis.

For a null hypothesis that the net effect is \( A_0, \) the number of successes in the treatment group (had they been assigned to control) is \( s_T - A_0. \) This quantity has a \( \mathrm{Hypergeom}(N, s_C + s_T - A_0, n_T) \) distribution, with expected value \( n_T \cdot (s_C + s_T - A_0) / N. \) We set these quantities equal and solve:

\( \begin{align*} s_T - A_0 &= n_T \cdot (s_C + s_T - A_0) / N \\ \Rightarrow A_0 &= \frac{N \cdot s_T - n_T \cdot (s_C + s_T)}{N - n_T}. \end{align*} \)

Plugging in \( N = 2000, \) \( n_T = 1000, \) \( s_T = 130, \) and
\( s_C = 100, \) we get \( A_0 = 30. \) The p-value associated with
this hypothesis is 1.0, aligned with our claim in the last section
that the point estimate is *always* part of the confidence interval.

In other cases, the result of this formula may not be an integer. In that case we can calculate the p-values corresponding to the floor and ceiling and call the point estimate the one with the larger p-value.

## Summary and Further Reading

In our last post, we discussed how a sharp null hypothesis, in
combination with the randomization used to conduct the experiment,
creates a world we can explore. Generalizing the technique of proof by
contradiction, sufficiently unlikely outcomes lead us to reject the
hypothesis in question. In this post, we saw this procedure is not
limited to the sharp null hypothesis of no effect, but can be used for
*any* hypothesized pattern of counterfactual outcomes.

In the case of binary outcomes, there are \( 2^N \) possible counterfactual hypotheses (or \( 2^{n_T} \) when considering only the treatment group). Because of simplifications enabled by the use of contingency tables, we do not need to evaluate \( 2^N \) hypotheses, but rather only \( \mathcal{O}(\log n_T) \) to calculate a point estimate and confidence interval on the net (or attributable) effect.

This procedure did not assume the units involved in the experiment were a sample from a population; neither did it assume any model for the outcome. It only relied on the randomization used to conduct the experiment.

Our discussion focused on the treatment group, but the same procedure
could be applied to the control group. The net effect then would have
the interpretation as the “opportunity cost” of running the
experiment, the number of additional successes we *would* have
observed, had everyone been exposed to the treatment.
(Rigdon and Hudgens 2015) performed both analyses and
combined the results to estimate the average treatment effect (across
both groups).

(Rosenbaum 2002) and (Rosenbaum 2003) applied similar approaches to estimate effect sizes for continuous outcomes, but I find the results hard to interpret. In contrast, the net or attributable effect introduced in (Rosenbaum 2001) and discussed here is exactly the quantity most relevant to characterizing the impact of a marketing campaign.

## References

*Scandinavian Journal of Statistics*4 (2). [Board of the Foundation of the Scandinavian Journal of Statistics, Wiley]: pg. 49–70. http://www.jstor.org/stable/4615652.

*The Annals of Mathematical Statistics*34 (2). Institute of Mathematical Statistics: pg. 598–611. http://www.jstor.org/stable/2238406.

*Stat. Med.*34 (6): pg. 924–35.

*Biometrika*88 (1). [Oxford University Press, Biometrika Trust]: pg. 219–31. http://www.jstor.org/stable/2673680.

*Journal of the American Statistical Association*97 (457). [American Statistical Association, Taylor & Francis, Ltd.]: pg. 183–92. http://www.jstor.org/stable/3085773.

*The American Statistician*57 (2). [American Statistical Association, Taylor & Francis, Ltd.]: pg. 132–38. http://www.jstor.org/stable/30037246.