ECON622: Computational Economics with Data Science Applications

Numerical Linear Algebra with Iterative Methods

Jesse Perla

jesse.perla@ubc.ca

University of British Columbia

Overview

Motivation

In preparation for the ML lectures we cover some core numerical linear algebra concepts
Many of these are directly useful
- e.g. solving large LLS and systems of equations, such as you might find with a large scale two-way fixed effects model
- Solving systems of equations is useful in itself
Others will be helpful in setting up understanding for ML
- Matrix-free and iterative methods
- What governs complexity and convergence speed
- Conditioning
- Regularization

Summary and Material

See QuantEcon Krylov Methods and Matrix Conditioning

using LinearAlgebra, Statistics, BenchmarkTools, SparseArrays, Random
using LaTeXStrings, Plots, IterativeSolvers, Preconditioners, IncompleteLU, LinearMaps
using Arpack
Random.seed!(42);  # seed random numbers for reproducibility

Conditioning

Direct Methods and Conditioning

Some algorithms and some matrices are more numerically stable than others
- By “numerically stable” we mean sensitive to accumulated roundoff errors
A key issue is when matrices are close to singular, or almost have collinear columns. Many times this can’t be avoided, other times it can (e.g., choose orthogonal polynomials rather than monomials)
This will become even more of an issue with iterative methods, but is also the key to rapid convergence. Hint: \(A x = b\) is easy if \(A = I\), even if it is dense.

Condition Numbers of Matrices

\(\det(A) \approx 0\) may say it is “almost” singular, but it is not scale-invariant
The condition number \(\kappa\), given matrix norm \(||\cdot||\) uses the matrix norm

\[ \text{cond}(A) \equiv \|A\| \|A^{-1}\|\geq 1 \]
Expensive to calculate, can show that given spectrum

\[ \text{cond}(A) = \left|\frac{\lambda_{max}}{\lambda_{min}}\right| \]
Intuition: if \(\text{cond}(A) = K\), then \(b \to b + \nabla b\) change in \(b\) amplifies to a \(x \to x + K \nabla b\) error when solving \(A x = b\).
See Matlab Docs on inv for why inv is a bad idea when \(\text{cond}(A)\) is huge

Condition Numbers and Matrix Operations

The identity matrix is as good as it gets
Otherwise, the issue is when matrices are of fundamentally different scales

@show cond(I(2))
epsilon = 1E-6
A2 = [1.0 0.0
     1.0 epsilon]
@show cond(A2);
@show cond(A2');
@show cond(inv(A2));

cond(I(2)) = 1.0
cond(A2) = 2.0000000000005004e6
cond(A2') = 2.0000000000004997e6
cond(inv(A2)) = 2.0000000002323308e6

Conditioning Under Matrix Products

Matrix operations can often amplify the condition number, or may be invariant
Be especially careful with normal equations/etc.

lauchli(N, epsilon) = [ones(N)';
                       epsilon * I(N)]'
epsilon = 1E-8
L = lauchli(3, epsilon) |> Matrix
@show cond(L)
@show cond(L' * L)
L

cond(L) = 1.732050807568878e8
cond(L' * L) = 2.8104131146758097e32

3×4 Matrix{Float64}:
 1.0  1.0e-8  0.0     0.0
 1.0  0.0     1.0e-8  0.0
 1.0  0.0     0.0     1.0e-8

Stationary Iterative Methods

Direct Methods

Direct methods work with a matrix, stored in memory, and typically involve factorizations
- Can be dense or sparse
- They can be fast, and solve problems to machine precision
Typically are superior until problems get large or have particular structure
But always use the right factorizations and matrix structure! (e.g., posdef, sparse, etc)
The key limitations are the sizes of the matrices (or the sparsity)

Iterative Methods

Iterative methods are in the spirit of gradient descent and optimization algorithms
- They take an initial guess and update until convergence
- They work on matrix-vector and vector-matrix products, and can be matrix-free, which is a huge advantage for huge problems
- Rather than waiting until completion like direct methods, you can control stopping
The key limitations on performance are geometric (e.g., conditioning), not dimensionality
Two rough types: stationary methods and Krylov methods

Example from Previous Lectures

Variation on CTMC example: \(a >0\) gain, \(b > 0\) to lose
Solve the Bellman Equation for a CTMC

N = 100
a = 0.1
b = 0.05
rho = 0.05
Q = Tridiagonal(fill(b, N-1),
                [-a; fill(-(a + b), N-2); -b],
                fill(a, N-1))

r = range(0.0, 10.0, length = N)
A = rho * I - Q
v_direct = A \ r
mean(v_direct)

101.96306207828795

Diagonal Dominance

Stationary Iterative Methods reorganize the problem so it is a contraction mapping and then iterate
For matrices that are strictly diagonal dominant \[ |A_{ii}| \geq \sum_{j\neq i} |A_{ij}| \quad\text{for all } i = 1\ldots N \]
- i.e., sum of all off-diagonal elements in a row is less than the diagonal element in absolute value
Note for our problem rows sum to 0 so if \(\rho > 0\) then \(\rho I - Q\) is strictly diagonally dominant

Jacobi Iteration

To solve a system \(A x = b\), split the matrix \(A\) into its diagonal and off-diagonal elements. That is,

\[ A \equiv D + R \]

\[ D \equiv \begin{bmatrix} A_{11} & 0 & \ldots & 0\\ 0 & A_{22} & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & \ldots & A_{NN} \end{bmatrix}\,\, R \equiv \begin{bmatrix} 0 & A_{12} & \ldots & A_{1N} \\ A_{21} & 0 & \ldots & A_{2N} \\ \vdots & \vdots & \vdots & \vdots\\ A_{N1} & A_{N2} & \ldots & 0 \end{bmatrix} \]

Jacobi Iteration Algorithm

Then we can rewrite \((D + R) x = b\) as \[ \begin{aligned} D x &= b - R x\\ x &= D^{-1} (b - R x) \end{aligned} \]

Where \(D^{-1}\) is trivial since diagonal. To solve, take an iteration \(x^k\), starting from \(x^0\),

\[ x^{k+1} = D^{-1}(b - R x^k) \]

Code for Jacobi Iteration

Showing Jacobi Iteration and a better method, successive over-relaxation (SOR). Many better algoriths exist

v = zeros(N)

jacobi!(v, A, r, maxiter = 40)
@show norm(v - v_direct, Inf)
sor!(v, A, r, 1.1, maxiter = 40)
@show norm(v - v_direct, Inf);

norm(v - v_direct, Inf) = 0.0017762754968373429
norm(v - v_direct, Inf) = 9.052314453583676e-12

Krylov Methods

Krylov Subspaces

Krylov methods are a class of iterative methods that use a sequence of subspaces
The subspaces are generated by repeated matrix-vector products
- i.e., given an \(A\) and a initial value \(b\) we could generate the sequence
- \(b, A b, A^2 b, \ldots, A^k b\) and see
Note that the only operation we require from our linear operator \(A\) is the matrix-vector product. This is a huge advantage for large problems
e.g. Krylov method is Conjugate Gradient for posdef \(A\)

Conjugate Gradient

Solving this system with the conjugate gradient method
Using matrix, but could just implement \(A\) as a function

N = 100
A = sprand(100, 100, 0.1)
A = A * A'  # easy posdef
b = rand(N)
x_direct = A \ b
@show cond(Matrix(A * A'))
x = zeros(N)
sol = cg!(x, A, b, log = true, maxiter = 1000)
sol[end]

cond(Matrix(A * A')) = 1.0375698828984257e12

Converged after 183 iterations.

Iterative Methods for LLS

LSMR is one of several Krylov methods for solving LLS

\[ \min_{\beta} \| X \beta -y \|^2 + \alpha \| \beta\|^2 \]

Where \(\alpha \geq 0\). If \(\alpha = 0\) then it is delivers the ridgeless regression limit, even if underdetermined

LSMR Example

M = 1000
N = 10000
sigma = 0.1
beta = rand(M)
# simulate data
X = sprand(N, M, 0.1)
y = X * beta + sigma * randn(N)
beta_direct = X \ y
results = lsmr(X, y, log = true)
beta_lsmr = results[1]
@show norm(beta_direct - beta_lsmr)
println("$(results[end])")

norm(beta_direct - beta_lsmr) = 1.0615184241157149e-5
Converged after 15 iterations.

Matrix-Free LLS

To solve LLS problems, we need \(X u\) and \(X^T v\) products
We can provide those functions directly (cheating here by just using the matrix itself)

# matrix-vector product
X_func(u) = X * u

# adjoint-vector product
X_T_func(v) = X' * v

X_map = LinearMap(X_func, X_T_func, N, M)
results = lsmr(X_map, y, log = true)
println("$(results[end])")

Converged after 15 iterations.

Eigenvalue Problems

Variation on CTMC example: \(a >0\) gain, \(b > 0\) to lose

N = 4
a = 0.1
b = 0.05
Q = Tridiagonal(fill(b, N-1),
                [-a; fill(-(a + b), N-2); -b],
                fill(a, N-1))
# Find smallest magnitude eigenvalue (i.e. 0)
lambda, phi = eigs( Q', nev = 1, which = :SM, maxiter = 1000)
phi = real(phi) ./ sum(real(phi))
@show lambda
@show mean(phi);
Q'

lambda = ComplexF64[-9.270990685613062e-18 + 0.0im]
mean(phi) = 0.25

4×4 Tridiagonal{Float64, Vector{Float64}}:
 -0.1   0.05    ⋅      ⋅ 
  0.1  -0.15   0.05    ⋅ 
   ⋅    0.1   -0.15   0.05
   ⋅     ⋅     0.1   -0.05

Implementing Matrix-Free Operator for Adjoint

function Q_adj_product(x)
    Q_x = zero(x)
    Q_x[1] = -a * x[1] + b * x[2]
    for i = 2:(N-1)
        Q_x[i] = a * x[i-1] - (a + b) * x[i] + b * x[i+1]
    end
    Q_x[N] = a * x[N-1] - b * x[N]
    return Q_x
end
x_check = rand(N)
norm(Q_adj_product(x_check) - Q' * x_check)

0.0

Solving with a Wrapper for the Matrix-Free Operator

The LinearMap wrapper adds features required for algorithms (e.g. size(Q_adj_map and Q_adj_map * v overloads)

Q_adj_map = LinearMap(Q_adj_product, N)
# Get smallest magnitude only using the Q'(x) map
lambda, phi = eigs(Q_adj_map, nev = 1, which = :SM, maxiter = 1000)
phi = real(phi) ./ sum(real(phi))
@show lambda
@show mean(phi);

lambda = ComplexF64[1.6525994004046e-17 + 0.0im]
mean(phi) = 0.25

Preconditioning

Changing the Geometry

In practice, most Krylov methods are preconditioned in practice or else direct methods usually dominate. Same with large nonlinear systems
As discussed, the key issue for the convergence speed of iterative methods is the geometry (e.g. condition number of hessian, etc)
Preconditioning changes the geometry. e.g. more like circles or with eigenvalue problems spread out the eigenvalues of interest
Preconditioners for a matrix \(A\) requires art and tradeoffs
- Want be relatively cheap to calculate, and must be invertible
- Want to have \(\text{cond}(P A) \ll \text{cond}(A)\)
Ideal preconditioner for \(A x = b\) is \(P=A^{-1}\) since \(A^{-1} A x = x = A^{-1} b\)
- \(\text{cond}(A^{-1}A)=1\)! But that is equivalent to solving problem

Right-Preconditioning a Linear System

\[ \begin{aligned} A x &= b\\ A P^{-1} P x &= b\\ A P^{-1} y &= b\\ P x &= y \end{aligned} \] That is, solve \((A P^{-1})y = b\) for \(y\), and then solve \(P x = y\) for \(x\).

Raw Conjugate Gradient

N = 200
A = sprand(N, N, 0.1)   # 10 percent non-zeros
A = A * A'
b = rand(N)
@show cond(Matrix(A))
sol = cg(A, b, log = true, maxiter = 1000)
sol[end]

cond(Matrix(A)) = 972303.1119375983

Converged after 389 iterations.

Diagonal Preconditioner

A simple preconditioner is the diagonal of \(A\)
This is cheap to calculate, and is invertible if the diagonal has no zeros
For some problems this has a huge impact on convergence/condition, for others it does nothing

P = DiagonalPreconditioner(A)
sol = cg(A, b; Pl = P, log = true, maxiter = 1000)
sol[end]

Converged after 367 iterations.

Incomplete LU or Cholesky

Iterate part of the way on an LU or Cholesky factorization
Not the total inverse, but can make a big difference

P = ilu(A, τ = 0.1)
sol = cg(A, b; Pl = P, log = true, maxiter = 1000)
sol[end]

Converged after 151 iterations.

Others

In the above we aren’t getting huge gains, but it is also lacking structure
If you have problems with multiple scales, as might come out of discretizing multiple dimensions in a statepsace, see multigrid methods
- Algebraic Multigrid preconditioner is often useful even outside of having different scales
Other preconditioners include ones intended for Graph Laplacians such as approximate cholesky decompositions and combinatorial multigrid preconditioners.
- See paper for more