Introduction to Kernel Methods and Gaussian Processes
Let \(x \in \mathbb{R}^M\) be observations and \(X \in \mathbb{R}^{N \times M}\) be the “design” matrix
To solve \(\min_{\theta}\left\{||X \theta - y||^2 + \frac{\lambda}{2}||\theta||^2\right\}\) for \(\lambda \geq 0\), form normal eqs.
\[ \theta = (X^{\top}X + \lambda I)^{-1} X^{\top} y \]
Without proof, terms out you can rearrange this expression to get the “other” normal equations
\[ \theta = X^{\top} (X X^{\top} + \lambda I)^{-1} y \]
Since \(M \ll N\) typically, often a bigger system to solve
\[ K_{ij} \equiv [X X^{\top}]_{ij} = \mathcal{K}(x_i, x_j)\,\text{ for i = 1, ..., N and j = 1, ..., N} \]
\[ \tilde{K}_{ij} \equiv \mathcal{K}(\tilde{x}_i, x_j),\,\text{ for i = 1, ..., P and j = 1, ..., N} \]
\[ \tilde{y} = \tilde{K} (K + \lambda I)^{-1} y \]
https://members.cbio.mines-paristech.fr/~jvert/svn/kernelcourse/slides/master2017/master2017.pdf
Generalizing our previous case, \(\mathcal{K}(x,x') \equiv (x \cdot x')^2\), to polynomial order \(p\) with cross-products \[ \mathcal{K}(x,x') \equiv (x \cdot x' + c)^p \]
The dimensionality of the underlying \(\phi(x)\) is combinatorial in \(p\)
But now we can just evaluate pairwise \(\mathcal{K}(x,x')\), just an inner product in \(M\) independent of \(p\)!
The cost will be that we need to store the \(N \times N\) matrix of pairwise comparisons
Our canonical example is \(\mathcal{K}(x,x') = x \cdot x'\), which is a measure of similarity between \(x\) and \(x'\). Note that \[ ||x - x'||_2 = \sqrt{\mathcal{K}(x, x) - 2 \mathcal{K}(x,x') + \mathcal{K}(x', x')} \]
Fix the lengths \(||x||_2 = ||x'||_2 = 1\) then notice that \(||x - x'||_2 = \sqrt{2 (1 - \mathcal{K}(x,x'))}\)
As \(\mathcal{K}(x,x')\) increases, the distance between \(x\) and \(x'\) decreases
Also notice that when \(\mathcal{K}(x,x') = 0\) (i.e. orthogonal since \(x \cdot x' = 0\)) we maximize the distance between \(x\) and \(x'\)
\[ \mathcal{K}(x, x';\ell) \equiv \exp(-\frac{||x - x'||_2^2}{2 \ell^2}) \]
Unsurprisingly, since Gaussian random variables can be characterized by the first two moments, so can GPs
Let the mean function be
\[ m(x) = \mathbb{E}[f(x)] \]
And the covariance function (kernel) be
\[ \mathcal{K}(x,x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))] \]
Then we can denote a GP as
\[ f(x) \sim GP(m(x), \mathcal{K}(x,x')) \]
\[ \begin{aligned} f_X\,|\,X &\sim \mathcal{N}(\mu_X, K_{XX})\\ \mu_X &\equiv [m(x_i)]_{i=1}^N\\ K_{XX} &\equiv [\mathcal{K}(x_i, x_j)]_{i,j=1}^N \end{aligned} \]
Let \(X\) and \(f_X\) be observable (without noise in simplest case)
Let \(\tilde{X}\) be new data where we want to forecast the distribution of \(f_{\tilde{X}}\)
See ProbML Book 2 Section 18.3 for more details
\[ \begin{aligned} f_{\tilde{X}}\,|\,X, f_X, \tilde{X} &\sim \mathcal{N}(\mu_{\tilde{X}}, \Sigma_{\tilde{X}})\\ \mu_{\tilde{X}} &\equiv \mu_{\tilde{X}} + K_{X\tilde{X}}^{\top} K_{XX}^{-1} (f_X - \mu_X)\\ \Sigma_{\tilde{X}} &\equiv K_{\tilde{X}\tilde{X}} - K_{X\tilde{X}}^{\top} K_{XX}^{-1} K_{X\tilde{X}} \end{aligned} \]
Can adjust for noisy observation \(y = f(x) + \sigma^2 \epsilon\) for \(\epsilon \sim \mathcal{N}(0,I)\) easily
\[ \min_{f \in \mathcal{F}} \left\{\sum_{n=1}^N (y_n - f(x_n))^2 + \frac{\lambda}{2}||f||\right\} \]
\[ \min_{f \in \mathcal{H}_{\mathcal{K}}} \left\{\sum_{n=1}^N(y_n - f(x_n))^2 + \frac{\lambda}{2}||f||_{\mathcal{H}_{\mathcal{K}}}\right\} \]
See ProbML Book 2 Section 18.3.7.3 for more details
The representer theorem says that a solution to ERM in a function space
\[ f^* = \arg\min_{f \in \mathcal{H}_{\mathcal{K}}} \left\{\sum_{n=1}^N\ell(f, x_n, y_n) + \frac{\lambda}{2}||f||_{\mathcal{H}_{\mathcal{K}}}\right\} \]
\[ f^*(x) = \sum_{n=1}^N \alpha_n \mathcal{K}(x, x_n) \]
Recall the min-norm, ridgeless OLS: \[ \lim_{\lambda \to 0} \arg\max_{\alpha} ||\alpha \cdot X - y||^2_2 + \lambda ||\alpha||^2_2 = \arg\max_{\alpha} ||\alpha||_2^2 \text{ s t. } \alpha \cdot X = y \]
With interpolating problem (i.e., \(\mathbf{r}(\cdot) = 0\) possible) take \(\lambda \to 0\) in Representer
\[ \begin{aligned} \min_{\mathbf{\alpha}, \eta} &\left\{\mathbf{\alpha}^{\top}K_{XX}(\theta)\mathbf{\alpha}\right\}\\ \text{s.t. }\, & \mathbf{r}(K_{XX}(\theta)\cdot \mathbf{\alpha}+ \mathbf{m}(\eta, \theta);\theta, \eta) = 0 \end{aligned} \]
gpytorch.means.ConstantMean()
, but could be function of \(x\)class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # Includes GaussianLikelihood parameters
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
Iter=0, Loss=0.9302332997322083
Iter=10, Loss=0.5079350471496582
Iter=20, Loss=0.17511649429798126
Iter=30, Loss=-0.02991340681910515
Iter=40, Loss=-0.020419616252183914
Iter=50, Loss=-0.034119490534067154
Iter=60, Loss=-0.03867233172059059
Iter=70, Loss=-0.0389210507273674
Iter=80, Loss=-0.04004982113838196
Iter=90, Loss=-0.039925917983055115
GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
with torch.no_grad(), gpytorch.settings.fast_pred_var():
test_x = torch.linspace(0, 1, 51)
observed_pred = likelihood(model(test_x))
fig, ax = plt.subplots()
lower, upper = observed_pred.confidence_region()
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'])