Naive Comparisons Under Endogeneity

Recently I have been reading Causal Inference: The Mixtape by Scott Cunningham. One thing I think Cunningham explains very well is the role of endogeneity in confounding even simple comparisons. I don’t have a background in economics, so I had never really grokked the concepts of endogenous and exogenous factors, especially as it related to causal inference. In this post, I’m going to discuss a few examples that highlight why it’s such an important distinction.

Cunningham discusses a “perfect doctor” who always knows the right treatment for her patient. Any decision making is judged relative to potential outcomes: if we make a certain decision, we get a certain result; if we make a different decision we get a potentially different result. The best decision is the one which leads to the best result, but since most of us don’t have a crystal ball, we can only speculate about the likely outcomes of particular decisions. We can only make the decision we believe will lead to the best result.

But suppose this perfect doctor, through her expertise, does in fact know in advance the outcome of the treatments she prescribes. A person with cancer comes in, and the doctor knows that if the patient undergoes chemo, the patient will live, but if the patient undergoes radiation or surgery, the patient will die. So the doctor recommends chemo and the patient is cured. Another person comes in with high cholesterol, and the doctor knows that if the patient starts eating healthier, the patient will get better, but if the patient begins a rigorous exercise regime, they will have a heart attack. So the doctor recommends a diet plan and the patient gets better. To an outside observer, it looks like no matter what the doctor recommends, the patient always gets better, so maybe it just doesn’t matter what the doctor recommends. The perfect doctor is making it look too easy!

Now imagine two doctors, neither perfect like the first example, but one of them very knowledgeable and experienced, and the other inexperienced and bumbling. We’ll call them Dr. House and Dr. Gilligan, respectively. Dr. House and Dr. Gilligan both work in the same hospital. Everyone knows that Dr. House is the best around, so whenever a hopeless case comes in it goes to Dr. House. And Dr. House is so good that even though the cases are hopeless, Dr. House is able to save the patient’s life 50% of the time. Everyone is afraid to send any patients to Dr. Gilligan, but there are too many patients and too few doctors, so the hospital sends him only the mildest of cases. Even so, Dr. Gilligan still manages to mess it up 10% of the time. So Dr. House has a patient survival rate of 50% and Dr. Gilligan has a patient survival rate of 90%. To an outside observer, it seems like Dr. Gilligan is the much better doctor.

Both of these examples involve naive comparisons that don’t hold up under scrutiny. In the first example, we are comparing the outcomes under one course of treatment with another course of treatment. But these situations are not comparable, because the treatment is not independent of the underlying condition. Indeed, the perfect doctor tailored her treatment to the underlying condition, and that’s what makes her the perfect doctor. And in the second example, we are comparing the results from one doctor to another, but the two doctors see very different patients: Dr. House only sees the hopeless cases and Dr. Gilligan only sees the mildest cases.

We make comparisons every day in order to inform our decisions, but seldom do we stop to think whether these comparisons make sense, or whether the comparisons are relevant to the decisions we’re making. The issue is whether the comparison we are making does indeed involve comparable observations. In both of these examples, the reason the comparisons were invalid is that there were human beings deciding what to do (which treatment to apply, which doctor to assign) based on what they believed was best. The decisions that were made were not independent of the potential outcomes, but rather were consciously and intentionally dependent on them.

This is what “endogenous” means: the decisions that were made were based on internal factors relevant to the outcome. In contrast, “exogenous” means decisions were made based on external factors not related to the outcome, such as random assignment. Any time humans are making decisions, they are almost by definition endogenous. Only a cruel or incompetent person would make decisions regardless of the best outcome.

Here’s another example. Two people have headaches: one severe, the other mild. The person with the severe headache thinks, “I would do anything to be rid of this headache. I’ll have some aspirin.” The other thinks, “well, I could take some aspirin, but then I’d have to get out of my chair and I just can’t be bothered.” An hour later, the person with the severe headache still has a headache, but it is much better than before. The second person’s headache has completely vanished. An outside observer, who cannot see the severity of the headaches, just sees one person with a headache after taking aspirin and another without a headache who didn’t. They might conclude that aspirin causes headaches!

Again, it is not a valid comparison. Each person chose their course of action (to take aspirin, or not), according to the severity of their symptoms and their other desires and priorities. Each person did what they believed was best for themselves. For the comparison to be valid, the courses of action would need to be independent of the potential outcomes, and when we are talking about human decision making, nothing could be further from the truth. We say the decision making is endogenous, being influenced by internal factors like beliefs and desires.

Through this lens, we see that virtually any comparison we might make is invalid. Comparing restaurants on Yelp? The people who went to different restaurants likely had different priorities (e.g. cost vs convenience vs service vs deliciousness), and the reviews simply reflect how the restaurant lived up to those incomparable priorities. Comparing hotels on a travel review site? Same issue.

How can we possibly make intelligent decisions then? By creating comparisons that are both valid and relevant. First of all, when we are in an experimental setting and can randomly assign different courses of action (different treatments, different restaurants) to different people, the decision making (who experiences what) is in fact exogenous, based solely on external factors, and thus the comparison is valid. That’s why A/B testing is so powerful and why I talk about it so much on this blog.

But even in an observational setting, it is often possible to improve the validity of the comparison. In the case of Dr. House and Dr. Gilligan, sure, these two doctors tend to see very different types of patients. But is it ever the case that Dr. Gilligan sees a patient that typically would be seen by Dr. House (perhaps Dr. House was too busy)? Is it ever the case that Dr. House sees a patient that would typically be seen by Dr. Gilligan? These are situations where the patient profiles overlap between the two doctors. We can compare how the two doctors perform solely on hopeless cases, and separately compare performance solely on mild cases. This then is a much more valid comparison.

People tend to take aspirin when they have severe headaches, but is it ever the case that a person takes aspirin for a mild headache (or no headache)? Is it ever the case that a person with a severe headache does not take aspirin? Then we can compare the outcomes just for severe headaches, and, separately, just for mild headaches. These are more valid comparisons.

Cunningham suggests using propensity scores not just to facilitate better comparisons (which has received some criticism recently), but also to detect when comparisons are invalid. We fit a model that predicts the probability an individual will take a particular course of action (such as a logistic regression model or a neural network) and plot histograms of these probabilities for the individuals associated with each course of action. If the histograms are radically different, it suggests endogeneity is playing a big role, rendering naive comparisons invalid. If the histograms are similar, that doesn’t prove the comparison is valid (perhaps there are other factors not considered), but it does provide greater assurance in our decisions.

Here’s one more example to illustrate these ideas. A few years ago, I was working at a company that released a new feature. The CEO asked whether the new feature improved retention. He suggested comparing the retention of people who used the feature with people who didn’t. This isn’t necessarily a valid comparison since perhaps some people are planning on quitting the app soon and therefore don’t bother trying the new feature. Others are very engaged with the app and love trying all the new features. The suggested comparison says less about the new feature than it does about the people who chose to use it.

The first thing we can do is fit a model that predicts how likely someone is to use the new feature (based on demographic or behaviors exhibited before the feature was released). We can then apply this model to the people who did use the feature, giving one histogram, and apply it to the people who did not use the feature, giving another histogram. Comparing these histograms, we might see dramatic differences, demonstrating that the naive comparison is invalid. The sorts of people who use the feature just aren’t comparable to the sorts of people who don’t use it. But then we can look for overlap: is it ever the case that the sort of person who would normally be expected to use the feature just didn’t for some reason? Is it ever the case that someone who normally wouldn’t be expected to use the feature, does in fact try it? Concentrate the comparisons just on those situations. Again, this isn’t perfect, but it is vastly more defensible than the naive comparison.

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Marketing Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.