User Segmentation from Heterogeneous Treatment Effects
Imagine we are attempting to identify segments within an audience, perhaps so we can market to them more effectively through personalization. A common approach to doing so is to apply some kind of clustering algorithm (such as K-means) based on various user covariates. I have never been especially happy with this approach: the resulting clusters seem pretty arbitrary.
If you have the resources, it may make more sense to talk to a few dozen of your potential customers and look for patterns. For example, if you’re a concert ticketing company, you might notice that some people like particular artists, some like a particular type of music, some just enjoy the energy of a big concert venue, and so forth. Based on these insights you may wish to develop different advertisements that speak to different segments.
Heterogeneous treatment effects offer one approach for predicting who will respond best to different creative strategies. Given $m$ segments, we might develop a creative for each, randomly assign different people in a marketing campaign to different creatives, and fit a model predicting the response rate (click through rate, or conversion rate, for example) to each creative. Specifically, suppose: $$ \begin{align} y_k^{(i)} &\sim \mathrm{Bern}(\pi_k^{(i)}) \\\ g\left(\pi_k^{(i)}\right) &= \eta_k\left(x^{(i)}\right) = \beta_k^T x^{(i)}, \end{align} $$ where $y_k^{(i)}$ is 1 if person $i$ responds positively to creative $k$ and 0 otherwise, and $g$ is a suitable link function, such as the logistic function. We can fit $m$ models, one for each creative, and since the creatives are randomly assigned, the model has a causal interpretation. We might then assign a person with covariates $x^{(i)}$ to segment $\mathrm{arg\,max}_k\{ \beta_k^T x^{(i)}\}$.
What this model misses is that a person might respond best to a particular creative for any number of reasons, not just because of the intention behind the creative. So instead imagine we develop $n_j$ creatives for segment $j$. The model becomes: $$ \begin{align} y_{jk}^{(i)} &\sim \mathrm{Bern}(\pi_{jk}^{(i)}) \\\ g\left(\pi_{jk}^{(i)}\right) &= \eta_{jk}\left(x^{(i)}\right) = \beta_{jk}^T x^{(i)}, \end{align} $$ where $j$ indicates the segment and $k$ represents the creative within the segment. Note that $\beta_{jk}$ is a vector, of length equal to the number of covariates.
Now, if person $i$ really belongs in segment $j$, then we expect $\beta_{jk}^T x^{(i)}$ to be high for $k=1, \ldots, n_j$ (and low for the other $\beta$s). This is accomplished if $\beta_{j1} \approx \beta_{j2} \approx \cdots \approx \beta_{jn_j} \approx \gamma_j$, where $\gamma_j$ acts as a representative for all creatives in segment $j$.
That motivates the optimization problem: $$ \begin{array}{cl} \mbox{minimize} & -\ell(\beta; y, x) + \lambda \sum_{j=1}^m \frac{1}{n_j} \sum_{k=1}^{n_j} \| \beta_{jk} - \gamma_j \|_2. \end{array} $$ This regularization strategy is called the group lasso. We normalize by $n_j$ in case there are different numbers of creatives in different segments, so as not to penalize a segment more just because it has more creatives.
As $\lambda$ becomes larger, the group lasso encourages more creative-specific parameters $\beta$ to exactly agree with the segment-specific $\gamma$. But if a particular $\beta_{jk}$ does not match $\gamma_j$, then in general all its entries will disagree (recall that $\beta_{jk}$ is a vector). We then assign people to segments based on $\gamma$, not $\beta$, assigning person $i$ to segment $\mathrm{arg\,max}_j \{ \gamma_j^T x^{(i)} \}$.
When the number of covariates is large we may wish to enforce sparsity-inducing regularization on $\gamma$ as well. This can help with interpretability of the segments. We therefore propose: $$ \begin{array}{cl} \mbox{minimize} & -\ell(\beta; y, x) + \lambda \sum_{j=1}^m \left( \frac{1-\alpha}{n_j} \sum_{k=1}^{n_j} \| \beta_{jk} - \gamma_j \|_2 + \alpha \| \gamma_j \|_1 \right), \end{array} $$ where $0 \leq \alpha \leq 1$ trades off between a sparser, more interpretable $\gamma$, and a better fit to the different creatives.
We can use cross-validation to tune $\lambda$ and $\alpha$. Having done so, we can inspect which $\beta$s are a perfect match to their corresponding $\gamma$s; if a particular segment exhibits consistency across creatives, we can be confident that other creatives designed for that segment will perform similarly. Whereas, if different creatives within a segment perform very differently, that may be a clue that segment is not capturing how people respond.
Note that the objective is not differentiable, so we must use a sub-gradient method to solve the problem. The objective is, however, convex, so we can still solve the problem efficiently, even when millions of people are involved, and our model includes thousands of covariates.