Attributable Effects

In a previous post, we discussed why randomization provides a “reasoned basis for inference” in an experiment. Randomization creates well-defined potential outcomes and quantifies the “strength of evidence” in an experiment, generalizing the mathematical technique of proof-by-contradiction. While randomization cannot prove causation to the level of logical certainty, it is a powerful tool for investigating causal relationships.

Randomization not only quantifies the plausibility of a causal effect but also allows us to infer something about the size of that effect. In this post, I will focus on a particular setting where effect size estimates are justified solely by randomization, requiring no models or assumptions. This post largely draws from (Rosenbaum 2001).

Binary Outcomes and the Sharp Null Hypothesis of No Effect

Throughout this post, we will consider only binary outcomes. Binary outcomes are common in industry. A person either purchases or not; they retain or not. We will first consider the plausibility of no effect whatsoever before considering other possibilities.

Suppose you have just finished an A/B test involving \( N \) people, \( n_T \) of whom were selected to receive a treatment. Among those who received treatment, \( s_T \) exhibited a positive response (that is, they bought something, or they retained, whatever). Among the \( n_C := N - n_T \) who did not receive the treatment, \( s_C \) exhibited a positive response.

The results of this experiment can be summarized in a contingency table.

Group Successes Trials
Control \( s_C \) \( n_C \)
Treatment \( s_T \) \( n_T \)
Total \( S \) \( N \)

I like working with real numbers, so I will use the following example throughout. Two thousand people are identified as a candidate audience for a marketing campaign. One thousand are selected at random to be held back from marketing (the control group); the other thousand will be exposed to the campaign (treatment). Among the control group, one hundred people purchased the product being marketed. In the treatment group, one hundred and thirty purchased.

Group Successes Trials
Control 100 1,000
Treatment 130 1,000
Total 230 2,000

In most real A/B tests, people are assigned treatment independently and with equal probability, so the number of people assigned to treatment (or control) is itself random. But our sample sizes are typically so large the variation in group sample sizes is typically negligible. Henceforth I will treat \( n_C \) and \( n_T \) (and \( N \)) as fixed.

With the hard numbers in the second table, we see we have equal sample sizes in both groups, and more successes in the treatment group than in the control group, so it looks like the treatment may have a positive effect. (Nothing in this approach requires equal sample sizes, or positive effects, I just like working with hard numbers.)

Suddenly a mysterious stranger appears. He whispers, “The treatment had no effect. Not one single person had their purchase behavior influenced by marketing. This is all just an illusion.”

You don’t like mysterious strangers in general, and this particular mysterious stranger seems especially arrogant, thinking he knows why anyone behaves the way they do, so you decide to take him down a notch. You say:

“Okay stranger. If what you’re saying is true, then really there are only two types of people. Those who purchase and those who don’t.

“Those 130 people who bought in the treatment group? You’re saying they would have bought even had they been assigned to control. In fact, no matter how we assigned people to treatment, the 230 who purchased, would still have purchased. And the others would not have purchased.

“All we did when we assigned 1,000 people to treatment, was select 1,000 at random from a known set of people. This set has 2,000 people, 230 of whom had already made up their minds to purchase. The remainder had already made up their minds not to purchase, and not one person could have their minds changed by marketing. It’s just a coincidence 130 of the purchasers were selected for the treatment group.

“And since this is exactly the characterization of a hypergeometric distribution, if what you’re saying is true, then the number of successes observed in the treatment group has a known distribution.”

You proceed to open up python and calculate:

>>> from scipy.stats import hypergeom
>>> [N, n_T, s_T, s_C] = [2000, 1000, 130, 100]
>>> rv = hypergeom(N, s_C + s_T, n_T)
>>> rv.sf(s_T - 1)  # - 1 since sf is > but p-value is >=
0.020952274191867567

After contemplating the pros and cons of one-sided vs two-sided tests, you decide to double the result and call it a p-value against the two-sided null hypothesis of no effect.

>>> 2 * rv.sf(s_T - 1)
0.041904548383735134

(In general, you can perform a two-sided test by conducting two one-sided tests, of inferiority and superiority, then doubling the smaller p-value and truncating at 1 (Cox et al. 1977; Rosenbaum 2019, sec. 3, Footnote 3). When the observed outcome is larger than expected under H0, the test of superiority leads to the smaller p-value, so we can just double that one. When in doubt, calculate:

>>> min(1.0, 2 * min(rv.cdf(s_T), rv.sf(s_T - 1))
0.041904548383735134

This annoying formula handles the discreteness of the hypergeometric distribution.)

While this does not disprove the stranger’s claim to the level of logical certainty, you know that nothing can do that. The possibility of no effect can never be disproven. And so you merely reject the stranger’s claim as implausible, and he vanishes in a poof of smoke. Only then do you realize the stranger was really just a framing device to summarize the main points of the previous post.

These main points are:

  • We can never disprove the null hypothesis of no effect to the level of logical certainty.
  • A sharp null hypothesis is not something we believe or disbelieve. Rather, it creates a world we can explore. In combination with the random assignment of people to treatment, a sharp null hypothesis allows us to make probabilistic statements about our observations. The more unlikely our observations, the more implausible the null hypothesis. Yet this probabilistic approach is never definitive.
  • If our observations are not particularly unlikely, in no sense does this constitute evidence the null hypothesis is true. A null hypothesis can never be proven true; we can only fail to reject it as false.

Other Sharp Null Hypotheses

The role of any null hypothesis is to create a world we can explore. As Stephen King once wrote, “There are other worlds than these.” There is nothing special about the hypothesis of no effect.

As if to prove it, a dragon appears out of nowhere. The dragon says, “That other stranger was a charlatan, but I know what’s really going on in this test. I know the counterfactual outcomes for each person in the test. It would take a long time to list them out one by one, so instead I’m just going to summarize the control outcomes in a contingency table.”

Group Successes under Control Trials
Control 100 1,000
“Treatment” 110 1,000
Total 210 2,000

As you start to explore this world the dragon created for you, you notice there are 20 fewer successes in the treatment group than there were before. The dragon is alleging that, had the treatment group been instead held back from marketing, there would have been 20 fewer purchases. (This second row isn’t really a “treatment group” then but rather a second control group. But, it’s the same people as what we’ve been calling the treatment group, so we’ll maintain the label.) That is, the dragon is alleging that marketing caused 20 purchases.

“If marketing caused only 20 purchases, then why are there still more purchases in the treatment group than in control?” you ask suspiciously. But we need not believe a null hypothesis in order to explore it, so let’s continue.

You next notice the first row of the contingency table is the same as before. If it were otherwise, you could immediately dismiss the dragon’s claim. The dragon claims this is the contingency table of control outcomes, but you observe the control outcomes in the control group. If the dragon said anything else, you could disprove his claim (to the level of logical certainty). Only claims involving counterfactual outcomes require speculation!

You realize there are in principle \( 2^{n_T} \) possible counterfactual outcomes for the treatment group, but there is a lot of redundancy. Everyone exhibiting a positive response in the treatment group is identical from the perspective of a contingency table, as are those exhibiting a negative response. A claim about counterfactual outcomes for the treatment group is effectively a claim about how many observed successes would have been failures, had the individuals not received treatment, and how many observed failures would have been successes. There are \( (s_T + 1) \times (n_T - s_T + 1),\) not \( 2^{n_T} \) distinct hypotheses.

But only the net effect shows up in the contingency table of alleged control outcomes. The hypotheses:

  • 30 successes in the treatment group would have been failures, and 10 failures would have been successes, had this group not received treatment; and,
  • 20 successes in the treatment group would have been failures, but no failures would have been successes,

both involve a net 20 fewer successes under control. Both correspond to an alleged 20 purchases caused by marketing. Both hypotheses lead to the same contingency table of control outcomes. So there is still redundancy among the \( (s_T + 1) \times ( n_T - s_T + 1) \) hypotheses. Two hypotheses corresponding to the same net effect lead to the same contingency table.

(Rosenbaum 2001) uses the term “attributable effect” instead of “net effect” since it is the number of successes in the treatment group that were caused by, or attributable to, the treatment.

There are only \( n_T + 1\) distinct hypotheses then. The net effect can involve at most \( s_T \) successes becoming failures (for a contingency table with a 0 entry), or at most \( n_T - s_T \) failures becoming successes, plus a net effect of 0, for a total of \( s_T + (n_T - s_T) + 1= n_T + 1 \) possible net effects.

In this table of control outcomes, there really are only two types of people, according to their response under control. So if the purported net effect is correct, then the number of (control) successes in the treatment group follows a hypergeometric distribution. We calculate a two-sided p-value against the hypothesis that the treatment led to 20 additional successes:

>>> a0 = 20
>>> rv = hypergeom(N, s_C + s_T - a0, n_T)
>>> min(1.0, 2 * min(rv.cdf(s_T - a0), rv.sf(s_T - a0 - 1))
0.5115930741739885

The claim is plausible, at least from the perspective of the observations. We might have other reasons for considering the claim implausible, such as dragons being generally untrustworthy, but the data alone are insufficient to reject the possibility. We could repeat this exercise for each candidate net effect, from \( -s_T \) to \( +(n_T - s_T), \) retaining the effects with p-values above, say 0.05. That gives us a 95% confidence interval on the net effect.

A net effect of 1, corresponding to 1 purchase caused by marketing, has two-sided p-value 0.049, while a net effect of 2 has p-value 0.057. So we reject a net effect of 1 as (barely) implausible, but retain a net effect of 2 as plausible.

A net effect of 55 has p-value 0.057, while a net effect of 56 has p-value 0.047. So we retain a net effect of 55 as plausible, but reject 56 as being inconsistent with the data. A 95% confidence interval has endpoints 2 and 55.

We don’t even need to evaluate \( n_T + 1 \) hypotheses in order to calculate a confidence interval. In the next section, we will show how calculating a point estimate is just simple algebra. The point estimate by definition is consistent with the data and thus is within the confidence interval.

We know the net effect is at most \( s_T, \) so bisection may be used to find the largest integer between the point estimate and \( s_T \) with a p-value at least equal to 0.05. This is the upper bound on a 95% confidence interval. Similarly, we know the net effect is at least \( -(n_T - s_T), \) so bisection may be used to find the lower bound on the interval. This bisection runs in \( \mathcal{O}(\log n_T ) \) time.

We started off with the daunting task of considering \( 2^{n_T} \) hypotheses, but in fact only need to consider \( \mathcal{O}(\log n_T ) \) hypotheses to calculate a point estimate and confidence interval.

The Hodges-Lehmann Trick

(Hodges and Lehmann 1963) describe a “trick” for calculating a point estimate based on a hypothesis test. This point estimate is the value that leads to the test statistic having its expected value under the null hypothesis.

For a null hypothesis that the net effect is \( A_0, \) the number of successes in the treatment group (had they been assigned to control) is \( s_T - A_0. \) This quantity has a \( \mathrm{Hypergeom}(N, s_C + s_T - A_0, n_T) \) distribution, with expected value \( n_T \cdot (s_C + s_T - A_0) / N. \) We set these quantities equal and solve:

\( \begin{align*} s_T - A_0 &= n_T \cdot (s_C + s_T - A_0) / N \\ \Rightarrow A_0 &= \frac{N \cdot s_T - n_T \cdot (s_C + s_T)}{N - n_T}. \end{align*} \)

Plugging in \( N = 2000, \) \( n_T = 1000, \) \( s_T = 130, \) and \( s_C = 100, \) we get \( A_0 = 30. \) The p-value associated with this hypothesis is 1.0, aligned with our claim in the last section that the point estimate is always part of the confidence interval.

In other cases, the result of this formula may not be an integer. In that case we can calculate the p-values corresponding to the floor and ceiling and call the point estimate the one with the larger p-value.

Summary and Further Reading

In our last post, we discussed how a sharp null hypothesis, in combination with the randomization used to conduct the experiment, creates a world we can explore. Generalizing the technique of proof by contradiction, sufficiently unlikely outcomes lead us to reject the hypothesis in question. In this post, we saw this procedure is not limited to the sharp null hypothesis of no effect, but can be used for any hypothesized pattern of counterfactual outcomes.

In the case of binary outcomes, there are \( 2^N \) possible counterfactual hypotheses (or \( 2^{n_T} \) when considering only the treatment group). Because of simplifications enabled by the use of contingency tables, we do not need to evaluate \( 2^N \) hypotheses, but rather only \( \mathcal{O}(\log n_T) \) to calculate a point estimate and confidence interval on the net (or attributable) effect.

This procedure did not assume the units involved in the experiment were a sample from a population; neither did it assume any model for the outcome. It only relied on the randomization used to conduct the experiment.

Our discussion focused on the treatment group, but the same procedure could be applied to the control group. The net effect then would have the interpretation as the “opportunity cost” of running the experiment, the number of additional successes we would have observed, had everyone been exposed to the treatment. (Rigdon and Hudgens 2015) performed both analyses and combined the results to estimate the average treatment effect (across both groups).

(Rosenbaum 2002) and (Rosenbaum 2003) applied similar approaches to estimate effect sizes for continuous outcomes, but I find the results hard to interpret. In contrast, the net or attributable effect introduced in (Rosenbaum 2001) and discussed here is exactly the quantity most relevant to characterizing the impact of a marketing campaign.

References

Cox, David R., Emil Spjøtvoll, Søren Johansen, Willem R. van Zwet, J. F. Bithell, Ole Barndorff-Nielsen, and M. Keuls. 1977. “The Role of Significance Tests [with Discussion and Reply].” Scandinavian Journal of Statistics 4 (2). [Board of the Foundation of the Scandinavian Journal of Statistics, Wiley]: pg. 49–70. http://www.jstor.org/stable/4615652.
Hodges, J. L., and E. L. Lehmann. 1963. “Estimates of Location Based on Rank Tests.” The Annals of Mathematical Statistics 34 (2). Institute of Mathematical Statistics: pg. 598–611. http://www.jstor.org/stable/2238406.
Rigdon, Joseph, and Michael G Hudgens. 2015. “Randomization Inference for Treatment Effects on a Binary Outcome.” Stat. Med. 34 (6): pg. 924–35.
Rosenbaum, Paul R. 2001. “Effects Attributable to Treatment: Inference in Experiments and Observational Studies with a Discrete Pivot.” Biometrika 88 (1). [Oxford University Press, Biometrika Trust]: pg. 219–31. http://www.jstor.org/stable/2673680.
———. 2002. “Attributing Effects to Treatment in Matched Observational Studies.” Journal of the American Statistical Association 97 (457). [American Statistical Association, Taylor & Francis, Ltd.]: pg. 183–92. http://www.jstor.org/stable/3085773.
———. 2003. “Exact Confidence Intervals for Nonconstant Effects by Inverting the Signed Rank Test.” The American Statistician 57 (2). [American Statistical Association, Taylor & Francis, Ltd.]: pg. 132–38. http://www.jstor.org/stable/30037246.
———. 2019. Observation & Experiment: An Introduction to Causal Inference. Harvard University Press.

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.

Related