Distribution of Local Minima in Deep Neural Networks
The “unreasonable effectiveness of deep learning” has been much discussed. Namely, as the cost function is non-convex, any optimization procedure will in general find a local, non-global, minimum.
Actually, algorithms like gradient descent will terminate (perhaps because of early stopping) before even reaching a local minimum. For many experts in optimization, this seems like a bad thing. Concretely, it seems like the performance of networks trained in this way would be much worse than other optimization-based systems where we are in fact able to find the global minimum, such as logistic regression.
In practice, deep learning dramatically outperforms logistic regression in many situations, especially where the true relationship between features and labels is too complex to be captured by such simple models. I don’t think anyone has a full understanding of why deep learning works so well, though some parts of the story are clear.
First, optimization algorithms minimize a cost function on the training data, but performance is judged on a separate test set. Even if we magically found the global minima on the training data, there is no guarantee that network would have the best performance on the test set. In fact, we often intentionally stop training even before finding a local minimum as a form of regularization.
This distinction between training set performance and test set performance isn’t entirely satisfactory, however. In simpler (convex) models like logistic regression, the goal is to find an appropriate kind and amount of regularization so that the global minimum attained during training does in fact provide the best test set performance. We don’t say, “test set performance is different than training set performance so we don’t care about optimizing too much”. We say, “test set performance is different than training set performance, so let’s use regularization to make sure the two problems are as closely aligned as possible”. In other words, the way to build the best model has two ingredients: aligning the optimization problem with test set performance using regularization, and finding the global minimum on the training set. That suggests that if we are using the right kind and amount of regularization, finding the global minimum on the training set really will translate to better test set performance.
Another argument that gets raised is that in high dimensions, most equilibrium points are saddle points, not local minima or maxima. In order for an equilibrium point to be a local minimum (or maximum), all of the eigenvalues of the Hessian would need to be positive (or negative). As the dimensions increase, there are more and more ways for the eigenvalues to have mixed signs, so saddle points become increasingly common. This is offered to explain why second-order methods like Newton’s method don’t work well for deep learning: they get stuck at saddle points, whereas gradient methods seem not to. (This doesn’t seem like a serious problem to me: once at an equilibrium point, try moving in random directions and see if anything improves the cost function; then we will have escaped the saddle point. The high computation burden of second-order methods is a more compelling reason to avoid them.)
But this seems like a strange argument, too: “oh you think we shouldn’t be satisfied with local minima instead of the global minimum? That’s cute: we couldn’t even find a local minimum if we wanted to, which we don’t because training set performance is not the same as test set performance!”.
A much more compelling explanation would be if the true global minimum was not much lower than whatever point we happen to stop training. To explore this idea, I started thinking about how many local minima there might be in a deep learning model, and how their cost compares to the global minimum. For example, suppose the number of local minima increases at least quadratically with the number of layers, or hidden units, or training examples, or something like that. Then for modern networks, there are effectively infinite local minima (and even more saddle points, of course). Suppose further that most of those local minima have a cost only slightly higher than the global minimum. Then if we pick a random set of initial weights and run gradient descent until convergence, the random local minimum we find will, in all probability, be nearly as good as the global minimum.
Finally, suppose that the cost function tends to be fairly flat in the neighborhood of each local minimum. Then stopping before we reach it leads to performance that is almost as good as the local minimum, which in turn is almost as good as the global minimum. There are three assumptions I’ve made here: there are lots of local minima in realistic deep learning models, most local minima are nearly as good as the global minimum, and the cost function is pretty flat near local minima. If these three things were true, that would explain, to me, why deep learning models work so well.
I have no idea whether any of those assumptions are true, but I have a feeling people smarter than I am have already thought about these things, so my intention is to find out who else is thinking about “the topology of deep learning”.
Here are some papers I found:
- 2016
- 2018
- 2019