You are absolutely right! The choice of basis is quite important and can effect the ‘implicit bias’ (in this case, the minimum norm solution). Muthukumar et. al. have some interesting discussion about this for the minimum norm solutions in https://arxiv.org/pdf/1903.09139.pdf (Figure 5)

We implemented the normalization you suggested in this notebook https://colab.research.google.com/drive/1g8-oFZspHfKubT-8kNY6UfzLEBzVdtks to see what it looks like. As you pointed out, the function at d=10K does look quite different from the unnormalized basis.

This begs the question— “do operations such as rescaling of features break double descent? ” We don’t know for sure, but analysis of the linear regression problem suggests that it does not. Advani & Saxe 2017 https://arxiv.org/pdf/1710.03667.pdf, Mei & Montenari 2019 https://arxiv.org/pdf/1908.05355.pdf (and others) show that the test error diverges to infinity at d=N (in the limit of both d, N approaching infinity and some other assumptions). This is also visible in this particular example with normalization. As you said, both d=20 and d=10K overfit, but they overfit differently— d=10K is still a (relatively) smooth function + localized spikes, whereas d=20 behaves poorly more globally, resulting in much higher test error.

]]>– Specifying the model not only involves choosing the degree of the polynomial to fit the data (3 vs. 1000), but also involves choosing a basis of polynomials, and the choice of basis appears very important.

– In particular, the fitted curve is sensitive to how the basis vectors are normalized. While Legendre polynomials are orthogonal, they are not orthonormal: the n-th polynomial has l2 norm sqrt(2/(2*n+1)) (see https://en.wikipedia.org/wiki/Legendre_polynomials). This means that the gradient-descent (or minimum-l2-norm) solution using the standard Legendre polynomial basis has a natural preference for lower degree basis vectors (which have a larger norm), and this inherent “regularization” looks to be very important for avoiding overfitting in the high-dimensional case (i.e., once you’re completely interpolating the in-sample data).

– Indeed, if I replace your definition of “G” (the Legendre basis) with an orthonormal version — a seemingly natural alternative:

def G(x, d):

B = np.polynomial.legendre.legvander(x, d)

B = B*np.sqrt(2*np.arange(0, B.shape[1], 1)+1)

return B

then the result for large degree grossly overfits the in-sample data (although in a very different way from the degree-20 case).

Presumably something similar is happening in other models that exhibit “double descent”, but in a way that’s probably much messier to analyze than polynomial interpolation! ]]>

I agree. Our understanding of these issues is far better now than 20 years ago. In particular, the question about the effective number of parameters is no

longer very relevant. I am not quite sure what Breiman meant by “poor” local minima — maybe those that do not generalize?

It seems that a reasonably complete theoretical analysis is now within reach.

It is perhaps slightly disconcerting that we needed the practical success of deep learning to point us to something which has been there all along.

Misha

]]>While we might have not found answers yet to Breiman’s questions, I do think we now know we should phrase the questions a little differently. For example, the “effective number of parameters” is a useful measure, but it won’t be on its own an explanation of generalization performance, in the sense that, no matter how you measure, the large networks are truly over-parameterized. Similarly, I think that the question is not really why SGD arrives at a good local minima (i.e., one that has small loss for the function it is optimizing) but rather why it arrives at a solution that generalizes well, and the latter property is correlated with but not identical to minimizing the loss function.

]]>Thanks for summarizing some important questions of modern machine learning.

From a historical perspective one may say that these foundational fissures have been present in machine learning all along (or, at least, for a long time), yet the practical success of deep learning forced us to face the real nature of the underlying phenomena.

One historical reference is the Leo Breiman’s note “Reflections After Refereeing Papers for NIPS” from 1995, where he discussed the current (at the time) state of understanding of neural networks. In particular, he asks the following questions:

“1. Why don’t heavily parameterized neural networks overfit the data?

2. What is the effective number of parameters?

3. Why does not backpropagation head for a poor local minimum?

4. When should one stop backpropagation and use the current parameters?”

All of these questions seem fresh 24 years later.

A similar set of observations was made about the boosting algorithms. In particular, the paper “Boosting the margin: a new explanation for the effectiveness of voting methods” (https://projecteuclid.org/euclid.aos/1024691352) starts with “One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero.” While the theoretical analysis there does not seem to apply in broader settings (e.g., to interpolation of noisy data), the proposed idea that increasing the complexity of the hypothesis space allows for better choice of classifiers is very insightful.

Thanks to deep learning, it now looks like we can view all of these phenomena in the same light, leading to a hope that a unified theory for supervised learning may be possible.

My optimistic view is that all of these puzzles will be addressed and usefully understood mathematically, once we find the right analytical tools.

Misha

]]>