Overparameterization, Implicit Bias, and Double-Descent
Given our understanding of the classic bias-variance tradeoff
The ERM process for complicated problems is not globally convex
Warning: active, incomplete literature in math, CS, and statistics.
But we can still give some intuition:
Without repeating the entire notation, we established our goal as minimizing the difference between the ideal \[ \mathbb{E}_{\mathcal{D} \sim p^*}\left[\min_{f \in \mathcal{H}} R(f, \mathcal{D}) - \min_{f \in \mathcal{F}} R(f, p^*)\right] = \varepsilon_{app}(\mathcal{H}) + \varepsilon_{est}(\mathcal{H}) \]
That is, a classic tradeoff between approximation error and estimation error given some true distribution \(p^*\) and some samples \(\mathcal{D}\)
The solution to overfitting is to keep adding more parameters
from https://arxiv.org/abs/1812.11118
from https://arxiv.org/abs/1812.11118
from https://arxiv.org/pdf/2105.14368.pdf
from https://arxiv.org/pdf/1912.02292.pdf
One interpretation is that ML optimization methods with sufficient parameters find a min-norm interpolating solution. Minimizing ERM: \[ \min_{f \in \mathcal{H}} \frac{1}{N}\sum_{n=1}^N \ell(f, x_n, y_n) \]
With large enough \(\mathcal{H}\) this interpolates, i.e. \(\ell(f, x_n, y_n) = 0\) for all \(n\)
Remember overdetermined LLS and the ridgeless regression
\[ \begin{aligned} \min_{f \in \mathcal{H}} &||f||_S\\ &\text{ s.t. } \ell(f, x, y) = 0,\,\text{ for all } (x, y) \in \mathcal{D} \end{aligned} \]
See https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L7.pdf
See https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L7.pdf
from ProbML Book 1 Section 13.5.6
from ProbML Book 1 Section 13.5.6
from https://arxiv.org/pdf/2003.00307.pdf