# Thoughts on Models with Regularization

My very first data science project was a failure. I had just started at Tinder and I had been tasked with understanding the conversion behavior of Tinder’s recently launched subscription product, Tinder Plus. So I built a model that predicted whether a person would convert based on age, gender, how many swipes they had made, their “swipe-right ratio”, etc.

The model didn’t work at all. It predicted that *no one* would convert. It
quickly became clear that it was because the overall conversion rate was pretty
small. The algorithm could get a very high accuracy rate just by predicting
that no one would convert, and it was a struggle to get it to change its mind.

The problem is that I was treating this as a classification problem, when I
should have been treating it like a regression problem. (Actually, I should
have been treating it as a causal inference problem, but it would be several
years before I could appreciate that.) So I built a model that predicted the
*probability* that a person would convert, based on various characteristics
about that person. I used a simple logistic regression model because that’s
what I had learned in school. (I would still use a logistic regression model
today, at least as a starting point, due to speed of fitting models and ease of
interpretation. Only once I felt I had squeezed every bit of performance out of
logistic regression would I turn to something like XGBoost.)

At least this model didn’t issue ridiculous predictions like “everyone has a 0% chance of converting” or anything. But it did make roughly the same prediction for everyone: the overall average conversion rate. I had hoped it to say, “oh this person has an 80% chance of converting, and this other person has a 0.01% chance of converting”.

It got worse when I started making plots. It said that the 80 year olds on Tinder were the most likely to convert. (That was before I learned how to clean the data; some of the ages were definitely bogus!) It took me a long time to realize that, in their simplest form, logistic regression models are monotonic in each feature. If there is a positive correlation between age and conversion rate, if a 28 year old is more likely to convert than a 21 year old, then an 80 year old is even more likely to convert!

Intuitively, it seemed more plausible that the relationship between age and
conversion rate was non-monotonic: it might increase as people got a bit older
and had more money to burn, but then decrease past a certain point. But simple
logistic regression models can’t capture this. I was lucky enough to take Rob
Tibshirani’s class at Stanford which covered Generalized Additive Models, so I
decided to try smoothing splines. And *wow* did that make a difference. The
pictures looked way more plausible (conversion rate increased with age up to a
certain point and then decreased – those 80 year olds weren’t converting), and
more importantly the predictions were much more varied: some people were much
more likely to convert than others.

I had included state (e.g. California or Nebraska) as a feature, but the predictions for the less populated states were all over the place. I needed regularization. I found a paper by Stephen Boyd on the Network Lasso and it seemed like exactly what I was looking for (it seems like whenever I’m stuck I’m able to find something, either in the Convex Optimization book or on Prof. Boyd’s list of publications, that gets me unstuck). The basic idea is that neighboring states would have their predictions smoothed together. A state like California had enough data that whatever the observed conversion rate was, that’s what would be predicted, but a state like Nebraska would inherit the conversion rate from neighboring states. I made some really pretty choropleth maps back then!

I had started to learn more about statistical inference and I knew point estimates weren’t enough. I also needed some way of capturing the uncertainty associated with models. I needed confidence intervals, so I implemented the bootstrap (which I also learned about in Rob Tibshirani’s class!) I noticed that the more regularization I used, the narrower the confidence intervals became. That didn’t seem right: why should there be less uncertainty associated with the predictions just because I was using more regularization?

Notably, there *were no* implementations of the Network Lasso in python at the
time. Also I was just starting my career as a data scientist and was eager to
prove myself. So I implemented my own. It used the Alternating Direction Method
of Multipliers, as described in the paper, *A Distributed Algorithm
for Fitting Generalized Additive Models* by E. Chu, A. Keshavarz, and S.
Boyd, to fit generalized additive models with features both continuous and
categorical. It was my first python library!

I called it *gamdist* because it fit Generalized Additive Models in a
DISTributed way. And then when I left Tinder, they agreed to open source it. I continued to
work on it for a bit after I left, but I just wasn’t sure what to *do* with it.

I had learned a lot about Machine Learning and Convex Optimization in my
Master’s Program, but not much about classical statistics. This whole project
made me realize how badly I needed to learn this stuff. I started reading
textbook after textbook (the first one I read was *All of Statistics* by Larry
Wasserman; it’s a good first book!). I slowly developed a mastery of
statistical hypothesis testing, power analysis, and interval estimation.

And recently I’ve been thinking about regularization again. As I wrote in my post on Empirical Bayes, it’s clear to me that regularization improves point estimates. Even in my A/B tests, I think some flavor of regularization (such as Empirical Bayes) is called for. I’m disinclined to change the way I calculate confidence intervals, but improved point estimates are welcome. To be explicit, I used to think if I was using regularization for my point estimates, I also needed to use regularization in my confidence intervals. But I don’t think that’s true. This bypasses the problem I had previously where my models seemed to have too little uncertainty.

I also think that the sparse flavors of regularization, as discussed in
*Statistical Learning with Sparsity* by Trevor Hastie, Robert Tibshirani, and
Martin Wainwright are especially valuable as a method of model selection. This
is kind of the same thing that a p-value addresses. A p-value is a disciplined
way of answering the question: do I really think this feature is associated
with the response? And I think for many years I was persuaded that a p-value is
the One True Way of answering that question. But in high dimensions, a
lasso-type estimator has strong theoretical guarantees, too. Why shouldn’t that
be just as rigorous as a p-value?

More recently my work has shifted away from traditional A/B testing. I’m incorporating more techniques from observational causal inference for things like heterogeneous treatment effect estimation (still within the context of A/B testing, just much more interesting than a simple t-test). Model selection is much more important here than in simple A/B tests. So I’m excited to get caught up on the latest developments in regularization and start playing with generalized additive models again!

I’ve started to explore the python ecosystem for these types of models and came across yaglm. I’ve reached out to the author, Iain, who shares my enthusiasm for this type of modeling. I’m hoping to migrate the best parts of gamdist to yaglm when I get a chance.

I spent the last several years building the foundational statistics knowledge to complement the Machine Learning and Convex Optimization I learned in school. But now I’m excited to turn back to some of the modern developments in statistics, especially regarding high-dimensional models, with, of course, a causal interpretation.