Statistics and Machine Learning: Better Together!

I got my masters degree in Electrical Engineering from Stanford in 2013. Electrical Engineering is a big field! At Stanford you have to select one depth area (where you take at least 3 courses) and three breadth areas (where you take at least 2 courses). My depth area was Machine Learning, and I forget what my breadth areas were, but one of them was Dynamic Systems and Optimization, which is a pretty good complement to Machine Learning.

When I started working at Tinder, my first job as a data scientist, I quickly realized there was a lot I still needed to learn about statistics. If there is one thing my career has taught me since then, it’s that there is a lot more to data science than import library; library.fit().

I spent the next several years desperately trying to fill the gaps in my statistics knowledge before anyone spotted me as the impostor I felt like I was! I feel so much more comfortable with statistics today than I did back then, calculating confidence intervals and statistical power, and scrutizining regression models for violations of assumptions.

In 2020 I turned back to Machine Learning. It has changed so much since I finished my masters degree! After I was furloughed from Ticketmaster, I completed the deep learning specialization on Coursera and took the first 3 courses in the Natural Language Processing specialization (I still hope to complete the last course someday).

I recently applied a Machine Learning technique for a project at Facebook (the technique itself doesn’t matter for this post). My boss asked me if the results were statistically significant. I was actually taken aback by the question!

You see, data scientists who do Machine Learning (I’ll call them MLDS’s to distinguish them from other types of data scientists) don’t really pay any attention to statistical significance. Most of them couldn’t actually define what it really means. They certainly act like it doesn’t matter. Instead, they focus on a related concept called “overfitting”. MLDS’s work with complex models with often thousands, millions, or even billions, of parameters, fitting them on enormous data sets. The risk with such complex models is that rather than learning some true underlying pattern that can be exploited on new data, they instead fit the haphazard patterns of the training data, useless and even counterproductive to future applications. This is called “overfitting”, but I think of it as a type of superstition: some invalid belief based on coincidental past experiences.

MLDS’s use techniques like regularization to reduce overfitting. These techniques make the training performance worse in order to improve future performance. Deciding on the amount and type of regularization is as much art as science, and that’s where MLDS’s spend much of their focus. Once the amount of regularization is decided, performance is assessed on a test set: new data not used to train the model or decide on the amount of regularization. This strategy ensures we are evaluating performance in a way (hopefully) representative of how the model will actually be used.

But at no point in this process do MLDS’s assess whether their model is statistically significant. One reason is that the standard equations for p-values assume that the model is fixed in advance. Selecting model details like regularization as part of the model fitting process invalidates those equations. I think that’s the number one reason MLDS’s don’t do it. They weren’t taught it, because there’s no equation to teach, and so they just evaluate model performance in other ways.

MLDS’s tend to focus on prediction problems anyway, where performance is fairly straightforward to assess. The great thing about prediction problems, is that we eventually find out what the real answer is. At that point we can compare our prediction to what actually happened and adjust accordingly. If the predictions are good enough, whether the model is statistically significant is irrelevant.

But I’m starting to see Machine Learning techniques being applied to inference problems, where we are trying to estimate a quantity, rather than making a simple prediction. These are typically low signal-to-noise ratio scenarios, where quantifying the uncertainty associated with the estimate is just as important as the estimate itself. Assessing the performance of these models is then really tricky.

As an example, consider predicting the decisions that a human will make. If you believe in free will (or at least the convincing illusion of free will), then no algorithm can do an especially good job no matter how much data or how many features are realistically available. To use the ML jargon, the Bayes error rate is high. It would be silly to make a prediction like, “this person is going to buy a car when shown this ad”. But it is perfectly sensible to estimate, say, a 0.1% chance that person will buy a car, with a 0.2% chance for someone else. Evaluating these models is more complicated than calculating an F1 score.

We have to answer questions like: what do we mean when we say there is a 0.1% chance the person will buy the car? If we make such a prediction, and then the person buys, was our prediction right or wrong? What does it mean for a probabilistic prediction to be right or wrong? These are the questions that statistics attempts to answer. I think it often makes sense to calculate p-values in this context, if only we knew how.

In certain situations, we can calculate p-values. We can split the data in half, perform model selection and fitting on one half and then evaluate performance (including calculating p-values) on the second half. This is really the same way ML performance is already evaluated, on a held-out test set.

In the last ten years, there have been some exciting advancements in “post-selection inference” that allow us to calculate p-values on the same data set that was used for fitting models, for certain model types. These methods really are at the intersection of classical statistics and modern machine learning.

Whether it is a classical statistics technique or a modern machine learning technique, whether we are doing prediction or inference, all data scientists must ask themselves if “the answer” they have found is signal or noise. P-values, confidence intervals, checking for overfitting: these are all attempts to address that question. No matter what kind of data scientist you are, you will be well-served by a strong foundation in both areas!

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.

Related