Min-Norm Solutions and Sobolev Spaces
Min-Norm Solution
\[
\min_x \| x \|_2^2 \quad \text{s.t.} Ax = b
\]
- Or, can think of as solving the “ridgeless regression”
\[
\lim_{\lambda \to 0}\left[\min_x \| Ax -b \|_2^2 + \lambda \| x \|_2^2\right]
\]
- Will become crucial in deep learning where the number of parameters \(\gg\) data
Algorithms + Optimization Problems
- A key requirement to make the switch to ML thinking is to remember that just seeing the optimization problem, e.g. \(\max_{x} f(x)\) may not be enough
- The algorithm itself will be important if there is there is multiplicity in solutions, if things are not numerically stable, etc.
- In the case above, we saw that using QR decomposition delivered the min norm
Linear Operators, not Matrices
- Recall: for \(x \in \mathbb{R}^N\) we should think of a \(f(x) = A x\) for \(A\in\mathbb{R}^{M \times N}\) as a linear transformation from \(\mathbb{R}^N\) to \(\mathbb{R}^M\)
- Definition of Linear: \(f(a x_1 + b x_2) = a f(x_1) + b f(x_2)\) for scalar \(a,b\)
- Many algorithms might be implementable just using the matrix-vector product or the transpose of the matrix-vector product?
- Maybe we don’t actually need to create a matrix? Can compose operations together?
- This will be related to a lot of ML algorithms and autodifferentiation. Hint: Jacobian \(\nabla f(x) = A\)
- The key to iterative methods will be the spectral properties of the Jacobian, which is related to the eigenvalues of \(A\)
\(L^p\) space
Let \(\Omega\) be an open subset of \(\mathbb{R}^n\). A function \(u: \Omega \rightarrow \mathbb{R}\)
\(L^p(\Omega)\) space: A function \(f: \Omega \rightarrow \mathbb{R}\) is in \(L^p(\Omega)\) if:
\[
\int_\Omega |f|^p \, dx < \infty
\]
- Useful in a lot of cases, but will be especially important when considering norms on a function space and whether a particular function solves a particular problem
Sobolev Space
A function \(u\) belongs to the Sobolev space \(W^{k,p}(\Omega)\) if:
\[
u \in L^p(\Omega)
\]
and all its weak derivatives up to order \(k\) are also in \(L^p(\Omega)\).
A function \(\phi \in W^{k,p}(\Omega)\) is said to be a weak derivative of \(u\) if:
\[
\int_\Omega u D^\alpha \phi \, dx = (-1)^{|\alpha|} \int_\Omega \phi D^\alpha u \, dx
\]
for all multi-indices \(\alpha\) with \(|\alpha| \leq k\).
Sobolev Norm
The Sobolev Norm for a function \(u \in W^{k,p}(\Omega)\) is defined as some variation on:
\[
\| u \|_{W^{k,p}(\Omega)} = \left( \sum_{|\alpha| \leq k} \int_\Omega |D^\alpha u|^p \, dx \right)^{1/p}
\]
where \(\alpha\) is a multi-index and \(D^\alpha u\) represents the weak derivative of \(u\).
You can choose whatever terms you want in \(\alpha\)
A Key Sobolev Semi-Norm
The key one to keep in mind is \(W^{1,2}(\Omega)\), which is the space of functions with square-integrable first derivatives.
\[
\| u \|_{W^{1,2}(\Omega)} = \left( \int_\Omega |\nabla u|^2 \, dx \right)^{1/2}
\]
- Note that we have the option to include or not include the \(|u|^2\) term itself when we define a particular norm
- This is a semi-norm because it is semi-definite (i.e., multiple \(u\) with \(\|u\| = 0\))
- Semi-norms serve two key purposes: establish equivalence classes, and prove a way to control length which will come up with algorithms
Sobolev Semi-Norms and Equivalence Classes
- A Sobolev semi-norms \(S\) define equivalence classes of functions
- i.e., any \(u_1\) and \(u_2\) such that \(\|u_1 - u_2\|_S = 0\) are in the same equivalence class
- In general, when we move to nonlinear and highly parameterized models there will be many solutions that are equivalent
- But if they are in approximately the same equivalence class, then who cares?
- Multiplicity of “parameters” doesn’t really matter if the functions do the same thing
Sobolev Semi-Norms and Occam’s Razor
- The other purpose is to give some sense of length, \(\|u_1\|_S < \|u_2\|_S\)
- This will come up with regularization since we may want to bias algorithms towards particular functions or interpret inherent bias in the algorithms
- The key interpretation here is that for Sobolev Norms we can think of variations on the \(W^{1,2}\) as determining how simple a function is
- If \(\|u_1\|_S < \|u_2\|_S\) then it has smaller gradients and fewer “wiggles”
- If both interpolate the same data, then we should prefer the simpler one. Occam’s Razor
- We won’t always be able to know the precise semi-norm when working with ML, but this is useful intuition
Sobolev Norms and Linear Functions
- Now, think of linear functions \(f(x) = \beta x\) where \(\beta\) is a vector, matrix, or scalar
- Consider this on a bounded domain \(\Omega\) so the integrals are well defined
- Then our Sobolev 1,2 norm, \(W^{k,p}\) here is simple:
\[
\|f\|_{W^{1,2}} = \|\beta\|_2
\] Where \(\|\beta\|_2\) is euclidean norm of the vector, or the Frobenius norm of the matrix
Back to the Ridgeless Regression
Now lets reinterpret our “ridgeless regression” with a
\[
\lim_{\lambda \to 0}\left[\min_{\beta} \| A\beta -b \|_2^2 + \lambda \| \beta \|_2^2\right]
\]
- This says that we are “penalizing” the norm of \(f(x) = \beta x\) in the \(W^{1,2}\) sense
- The limit \(\lambda \to 0\) means we are asymptotically dropping this penalty, but there is still this “bias” which makes solutions unique
- Normally unique to an equivalence class with in \(W^{1,2}\), but with linear functions they are unique. Why?
Min-Norm Solution as Occam’s Razor
Recall that this was also
\[
\min_{\beta} \| \beta \|_2^2 \quad \text{s.t.}\, A\beta = b
\]
- i.e., we are finding the minimum norm solution that interpolates our data
- And we can interpret the minimum norm through Occam’s Razor
- This general principle will apply when we think about nonlinear approximations as well, though we don’t need to fully interpolate (i.e., if \(\lambda > 0\) then we don’t need to interpolate perfectly)
- Suggests the crucial role of regularization. Discuss!
Singular Linear Systems of Equations
This isn’t just least squares
- Consider the case for finding solutions to \(A \beta = b\) where \(A\) is singular. Either no solution or infinite solutions exist
- If you solve a linear system with SVD or iterative methods it gives an answer! The min-norm solution
\[
\begin{aligned}
\min_{\beta} & \| \beta \|_2^2\\
&\text{s.t.} A\beta = b
\end{aligned}
\]
Is the Min-Norm Solution Special?
The min-norm solution to LLS/etc. is the closest projection to the column space of the data
For a given norm it is the unique solution to a well-specified problem which can often be interpreted through appealing to simplicity
It is also the unique “most stable” solution for a given norm. Loosely,
- Take \(b + \delta b\) for some small \(\delta b\) and/or \(A + \delta A\) for some small \(\delta A\)
- Then the min-norm solution is the one where \(\beta + \delta \beta\) is smallest
Another interpretation we will apply to ML and nonlinear models: min-norm solutions are the ones least sensitive to data perturbations
This will also come up with Bayesian statistics, if we apply a prior which is asymptotically non-informative, the min-norm solution is the MAP solution
Min-Norm Solution and Conditioning
- As a preview of the next lecture, consider if the matrix is almost, but not quite singular
- Tough to know due to numerical roundoff
- You may have learned from experience that everything works great if you:
- Tweak to the diagonal of the matrix, or to the normal equations for LLS, or to make a covariance matrix positive definite
- Consider how that is related to the min-norm solution and L2 penalized LLS (i.e., “ridge”)
- Will turn out to be exactly equivalent in many algorithms