ECON622: Computational Economics with Data Science Applications

Overparameterization, Implicit Bias, and Double-Descent

Jesse Perla

jesse.perla@ubc.ca

University of British Columbia

Overview

Summary

Given the previous lectures on deep-learning, we can now come back to optimization methods and regularization
At this point you should be convinced that:
1. It is possible to fit very approximations with high-dimensional parameters, far more than the number of data points (i.e., overparameterized)
2. Having flexible functional forms in our \(\mathcal{H}\) lets us use representations which may help us in many ways

Remain Skeptical!

Given our understanding of the classic bias-variance tradeoff
- What about overfitting?
The ERM process for complicated problems is not globally convex
- First-order methods only help me go to (convex)local optima?
- With overparameterization, are local optima even convex?

Overview

Warning: active, incomplete literature in math, CS, and statistics.

But we can still give some intuition:

First explore why the bias-variance tradeoff breaks down and give a mental model for how we can solve problems in high dimensions
Then explore the optimization process in high dimensions and explain why with enough dimensions (and randomness) it becomes easier
Finally, explain different sources of regularization (implicit and explicit) that occur during optimization, which helps explain why double descent can occur

Double-Descent and the Bias-Variance Tradeoff

Recall the Decomposition of Errors from ERM

Without repeating the entire notation, we established our goal as minimizing the difference between the ideal \[ \mathbb{E}_{\mathcal{D} \sim p^*}\left[\min_{f \in \mathcal{H}} R(f, \mathcal{D}) - \min_{f \in \mathcal{F}} R(f, p^*)\right] = \varepsilon_{app}(\mathcal{H}) + \varepsilon_{est}(\mathcal{H}) \]
That is, a classic tradeoff between approximation error and estimation error given some true distribution \(p^*\) and some samples \(\mathcal{D}\)

The Bias-Variance Tradeoff

Richer, deeper models enable better representations
- Can we find them given the massive number of parameters?
- Or do they overfit, find spurious patterns which fail to generalize?
Classic Bias-Variance tradeoff in statistics suggests a suite spot
- Too few parameters and you get too much “bias”, high \(\varepsilon_{app}(\mathcal{H})\)
- As you approach the number of parameters to the number of data points, you get too much “variance”, high \(\varepsilon_{est}(\mathcal{H})\)
- At the “interpolation threshold” it fits perfectly (i.e., the training error is zero) but the generalization is terrible
So classic statistics suggests looking for the sweet spot below the interpolation threshold

Classic Statistics is (Often) Wrong!

The solution to overfitting is to keep adding more parameters

The evaluation criteria here is the correct one: generalization performance
Your intuition should be that this requires some sort of regularization
In most of these methods you will “interpolate” the training data for zero training loss

Some References

Classic Statistics: Single Descent

from https://arxiv.org/abs/1812.11118

Double Descent

from https://arxiv.org/abs/1812.11118

Example Fitting Small Number of Data Points

from https://arxiv.org/pdf/2105.14368.pdf

Example with Increasing “Width” of Neural Network

from https://arxiv.org/pdf/1912.02292.pdf

Overparameterized Models Interpolate

One interpretation is that ML optimization methods with sufficient parameters find a min-norm interpolating solution. Minimizing ERM: \[ \min_{f \in \mathcal{H}} \frac{1}{N}\sum_{n=1}^N \ell(f, x_n, y_n) \]
With large enough \(\mathcal{H}\) this interpolates, i.e. \(\ell(f, x_n, y_n) = 0\) for all \(n\)
Remember overdetermined LLS and the ridgeless regression
- This did not require enormous amounts of data

Min-Norm Interpretation

With first-order optimizers can often interpret as solving

\[ \begin{aligned} \min_{f \in \mathcal{H}} &||f||_S\\ &\text{ s.t. } \ell(f, x, y) = 0,\,\text{ for all } (x, y) \in \mathcal{D} \end{aligned} \]

Where \(S\) is some (semi-)norm we generally cannot characterize
- In some cases we can prove this is Sobolev 1,2 (e.g.,with LLS)
- In practice, chooses smooth functions (Occam’s Razor)
Whether these are the correct (i.e., generalizable) solutions depends on whether the \(S\) aligns with the properties of the true \(p^*(x,y)\)
- Having lots of data is neither necessary nor sufficient

Loss Landscapes in Overparameterized Models

Overparameterized ERM Has Many Global Minima

See https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L7.pdf

With overparameterization, there are many global minima
In high dimensions the global minima become increasingly connected

Not All Global Minima are Alike

See https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L7.pdf

Some parameters generalize (estimated with test error) better than others
Intuition: better with more “volume” of parameters at low test error

Flat vs. Sharp Geometry

from ProbML Book 1 Section 13.5.6

We often think of flatter geometry near minima as worse (e.g., not identified)
With overparameterization this is the wrong intuition.
- Flat minima generalize better because they have similar loss over a larger volumes of parameter space. Less dependent on individual data

SGD Finds Flat Minima

from ProbML Book 1 Section 13.5.6

SGD finds finds more stable local minima, less sensitive to underlying data

Overparameterized Loss Landscapes

from https://arxiv.org/pdf/2003.00307.pdf

Intuition of optimization in low dimensions often breaks down in high dimensions and with overparameterization
Many classic examples where adding in more parameters makes optimization easier (e.g., more degrees of freedom to move)

Loose Intuition from this Literature

Warning: this literature is a work in progress, and hard to prove things formally for general problems
However some loose intuition from Loss landscapes and optimization in over-parameterized non-linear systems and neural networks and others
As you move from low-dimensional, underparameterized to high-dimensional overparameterized problems:
- All local minima increasingly become global minima (i.e., interpolation)
- Global minima are increasingly connected
- Nothing is convex, but increasingly the minima seem to fulfill a related condition (\(PL^*\)) which is all we need

Interpretations on the Importance of Flatness

Bayesian Interpretations:
- Flat corresponds to more posterior uncertainty (see ProbML Book 1 Section 13.5.6 and ProbML Book 2 Section 17.4.1)
- Larger region in parameter space with similar loss
- Less sensitive to small changes in the data (i.e., overfitting)

Regularization

Implicit/Inductive Regularization

The theme of much of this literature is on the idea of “hidden/implicit/inductive” bias/regularization that occurs during optimization of ML models
We saw this already in the LLS section where we showed how underdetermined LLS solved with most algorithms ends up at the ridgeless regression solution
This does not mean that implicit regularization is always enough, but it helps us understand its role (and whether it fights against us).

Sources of Regularization

We see this with just gradient descent in LLS, so not specific to nonlinear, having lots of parameters, or stochastic optimization
SGD itself has regularizing effects
Sometimes manual regularization is needed
- Early stopping, Weight decay (L2 regularization), etc.
See ProbML Book 1 Section 13.5.6
See Mark Schmidt’s CPSC440 Notes on Double-Descent Curves for more on regularization

Stochastic Gradients

See ProbML Book 1 Section 13.5.6
As we saw before, randomness in SGD and its variants can help us find flatter minima
The key is that it is the right sort of noise, which helps us avoid sharp minima and concentrate on flatter minima
This is true even if all local minima are global minima and connected
- The connected minima might even make it easier to traverse to the flatter regions

No Free Lunch!

Choose Algorithms for the Problem Class

It is worth ending with a restatement of the no free lunch theorems
- No optimization algorithm is universally superior; any algorithm’s performance is problem-dependent, such that its average performance across all possible problems is the same as that of any other algorithm.
In practice, this means that we will need to carefully choose the optimization methods, hypothesis classes, and regularization with a particular problem in mind by understanding the problem class well
Luckily: it seems many problems in economics are very similar, so we can use priors to choose methods, and HPO software lets us experiment