ECON622: Computational Economics with Data Science Applications

Numerical Linear Algebra with Iterative Methods

Jesse Perla

University of British Columbia

Overview

Motivation

  • In preparation for the ML lectures we cover some core numerical linear algebra concepts
  • Many of these are directly useful
    • e.g. solving large LLS and systems of equations, such as you might find with a large scale two-way fixed effects model
    • Solving systems of equations is useful in itself
  • Others will be helpful in setting up understanding for ML
    • Matrix-free and iterative methods
    • What governs complexity and convergence speed
    • Conditioning
    • Regularization

Summary and Material

using LinearAlgebra, Statistics, BenchmarkTools, SparseArrays, Random
using LaTeXStrings, Plots, IterativeSolvers, Preconditioners, IncompleteLU, LinearMaps
using Arpack
Random.seed!(42);  # seed random numbers for reproducibility

Conditioning

Direct Methods and Conditioning

  • Some algorithms and some matrices are more numerically stable than others
    • By “numerically stable” we mean sensitive to accumulated roundoff errors
  • A key issue is when matrices are close to singular, or almost have collinear columns. Many times this can’t be avoided, other times it can (e.g., choose orthogonal polynomials rather than monomials)
  • This will become even more of an issue with iterative methods, but is also the key to rapid convergence. Hint: \(A x = b\) is easy if \(A = I\), even if it is dense.

Condition Numbers of Matrices

  • \(\det(A) \approx 0\) may say it is “almost” singular, but it is not scale-invariant

  • The condition number \(\kappa\), given matrix norm \(||\cdot||\) uses the matrix norm

    \[ \text{cond}(A) \equiv \|A\| \|A^{-1}\|\geq 1 \]

  • Expensive to calculate, can show that given spectrum

    \[ \text{cond}(A) = \left|\frac{\lambda_{max}}{\lambda_{min}}\right| \]

  • Intuition: if \(\text{cond}(A) = K\), then \(b \to b + \nabla b\) change in \(b\) amplifies to a \(x \to x + K \nabla b\) error when solving \(A x = b\).

  • See Matlab Docs on inv for why inv is a bad idea when \(\text{cond}(A)\) is huge

Condition Numbers and Matrix Operations

  • The identity matrix is as good as it gets
  • Otherwise, the issue is when matrices are of fundamentally different scales
@show cond(I(2))
epsilon = 1E-6
A2 = [1.0 0.0
     1.0 epsilon]
@show cond(A2);
@show cond(A2');
@show cond(inv(A2));
cond(I(2)) = 1.0
cond(A2) = 2.0000000000005004e6
cond(A2') = 2.0000000000004997e6
cond(inv(A2)) = 2.0000000002323308e6

Conditioning Under Matrix Products

  • Matrix operations can often amplify the condition number, or may be invariant
  • Be especially careful with normal equations/etc.
lauchli(N, epsilon) = [ones(N)';
                       epsilon * I(N)]'
epsilon = 1E-8
L = lauchli(3, epsilon) |> Matrix
@show cond(L)
@show cond(L' * L)
L
cond(L) = 1.732050807568878e8
cond(L' * L) = 2.8104131146758097e32
3×4 Matrix{Float64}:
 1.0  1.0e-8  0.0     0.0
 1.0  0.0     1.0e-8  0.0
 1.0  0.0     0.0     1.0e-8

Stationary Iterative Methods

Direct Methods

  • Direct methods work with a matrix, stored in memory, and typically involve factorizations
    • Can be dense or sparse
    • They can be fast, and solve problems to machine precision
  • Typically are superior until problems get large or have particular structure
  • But always use the right factorizations and matrix structure! (e.g., posdef, sparse, etc)
  • The key limitations are the sizes of the matrices (or the sparsity)

Iterative Methods

  • Iterative methods are in the spirit of gradient descent and optimization algorithms
    • They take an initial guess and update until convergence
    • They work on matrix-vector and vector-matrix products, and can be matrix-free, which is a huge advantage for huge problems
    • Rather than waiting until completion like direct methods, you can control stopping
  • The key limitations on performance are geometric (e.g., conditioning), not dimensionality
  • Two rough types: stationary methods and Krylov methods

Example from Previous Lectures

  • Variation on CTMC example: \(a >0\) gain, \(b > 0\) to lose
  • Solve the Bellman Equation for a CTMC
N = 100
a = 0.1
b = 0.05
rho = 0.05
Q = Tridiagonal(fill(b, N-1),
                [-a; fill(-(a + b), N-2); -b],
                fill(a, N-1))

r = range(0.0, 10.0, length = N)
A = rho * I - Q
v_direct = A \ r
mean(v_direct)
101.96306207828795

Diagonal Dominance

  • Stationary Iterative Methods reorganize the problem so it is a contraction mapping and then iterate

  • For matrices that are strictly diagonal dominant \[ |A_{ii}| \geq \sum_{j\neq i} |A_{ij}| \quad\text{for all } i = 1\ldots N \]

    • i.e., sum of all off-diagonal elements in a row is less than the diagonal element in absolute value
  • Note for our problem rows sum to 0 so if \(\rho > 0\) then \(\rho I - Q\) is strictly diagonally dominant

Jacobi Iteration

  • To solve a system \(A x = b\), split the matrix \(A\) into its diagonal and off-diagonal elements. That is,

\[ A \equiv D + R \]

\[ D \equiv \begin{bmatrix} A_{11} & 0 & \ldots & 0\\ 0 & A_{22} & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & \ldots & A_{NN} \end{bmatrix}\,\, R \equiv \begin{bmatrix} 0 & A_{12} & \ldots & A_{1N} \\ A_{21} & 0 & \ldots & A_{2N} \\ \vdots & \vdots & \vdots & \vdots\\ A_{N1} & A_{N2} & \ldots & 0 \end{bmatrix} \]

Jacobi Iteration Algorithm

  • Then we can rewrite \((D + R) x = b\) as \[ \begin{aligned} D x &= b - R x\\ x &= D^{-1} (b - R x) \end{aligned} \]

Where \(D^{-1}\) is trivial since diagonal. To solve, take an iteration \(x^k\), starting from \(x^0\),

\[ x^{k+1} = D^{-1}(b - R x^k) \]

Code for Jacobi Iteration

  • Showing Jacobi Iteration and a better method, successive over-relaxation (SOR). Many better algoriths exist
v = zeros(N)

jacobi!(v, A, r, maxiter = 40)
@show norm(v - v_direct, Inf)
sor!(v, A, r, 1.1, maxiter = 40)
@show norm(v - v_direct, Inf);
norm(v - v_direct, Inf) = 0.0017762754968373429
norm(v - v_direct, Inf) = 9.052314453583676e-12

Krylov Methods

Krylov Subspaces

  • Krylov methods are a class of iterative methods that use a sequence of subspaces
  • The subspaces are generated by repeated matrix-vector products
    • i.e., given an \(A\) and a initial value \(b\) we could generate the sequence
    • \(b, A b, A^2 b, \ldots, A^k b\) and see
  • Note that the only operation we require from our linear operator \(A\) is the matrix-vector product. This is a huge advantage for large problems
  • e.g. Krylov method is Conjugate Gradient for posdef \(A\)

Conjugate Gradient

  • Solving this system with the conjugate gradient method
  • Using matrix, but could just implement \(A\) as a function
N = 100
A = sprand(100, 100, 0.1)
A = A * A'  # easy posdef
b = rand(N)
x_direct = A \ b
@show cond(Matrix(A * A'))
x = zeros(N)
sol = cg!(x, A, b, log = true, maxiter = 1000)
sol[end]
cond(Matrix(A * A')) = 1.0375698828984257e12
Converged after 183 iterations.

Iterative Methods for LLS

  • LSMR is one of several Krylov methods for solving LLS

\[ \min_{\beta} \| X \beta -y \|^2 + \alpha \| \beta\|^2 \]

  • Where \(\alpha \geq 0\). If \(\alpha = 0\) then it is delivers the ridgeless regression limit, even if underdetermined

LSMR Example

M = 1000
N = 10000
sigma = 0.1
beta = rand(M)
# simulate data
X = sprand(N, M, 0.1)
y = X * beta + sigma * randn(N)
beta_direct = X \ y
results = lsmr(X, y, log = true)
beta_lsmr = results[1]
@show norm(beta_direct - beta_lsmr)
println("$(results[end])")
norm(beta_direct - beta_lsmr) = 1.0615184241157149e-5
Converged after 15 iterations.

Matrix-Free LLS

  • To solve LLS problems, we need \(X u\) and \(X^T v\) products
  • We can provide those functions directly (cheating here by just using the matrix itself)
# matrix-vector product
X_func(u) = X * u

# adjoint-vector product
X_T_func(v) = X' * v

X_map = LinearMap(X_func, X_T_func, N, M)
results = lsmr(X_map, y, log = true)
println("$(results[end])")
Converged after 15 iterations.

Eigenvalue Problems

  • Variation on CTMC example: \(a >0\) gain, \(b > 0\) to lose
N = 4
a = 0.1
b = 0.05
Q = Tridiagonal(fill(b, N-1),
                [-a; fill(-(a + b), N-2); -b],
                fill(a, N-1))
# Find smallest magnitude eigenvalue (i.e. 0)
lambda, phi = eigs( Q', nev = 1, which = :SM, maxiter = 1000)
phi = real(phi) ./ sum(real(phi))
@show lambda
@show mean(phi);
Q'
lambda = ComplexF64[-9.270990685613062e-18 + 0.0im]
mean(phi) = 0.25
4×4 Tridiagonal{Float64, Vector{Float64}}:
 -0.1   0.05    ⋅      ⋅ 
  0.1  -0.15   0.05    ⋅ 
   ⋅    0.1   -0.15   0.05
   ⋅     ⋅     0.1   -0.05

Implementing Matrix-Free Operator for Adjoint

function Q_adj_product(x)
    Q_x = zero(x)
    Q_x[1] = -a * x[1] + b * x[2]
    for i = 2:(N-1)
        Q_x[i] = a * x[i-1] - (a + b) * x[i] + b * x[i+1]
    end
    Q_x[N] = a * x[N-1] - b * x[N]
    return Q_x
end
x_check = rand(N)
norm(Q_adj_product(x_check) - Q' * x_check)
0.0

Solving with a Wrapper for the Matrix-Free Operator

  • The LinearMap wrapper adds features required for algorithms (e.g. size(Q_adj_map and Q_adj_map * v overloads)
Q_adj_map = LinearMap(Q_adj_product, N)
# Get smallest magnitude only using the Q'(x) map
lambda, phi = eigs(Q_adj_map, nev = 1, which = :SM, maxiter = 1000)
phi = real(phi) ./ sum(real(phi))
@show lambda
@show mean(phi);
lambda = ComplexF64[1.6525994004046e-17 + 0.0im]
mean(phi) = 0.25

Preconditioning

Changing the Geometry

  • In practice, most Krylov methods are preconditioned in practice or else direct methods usually dominate. Same with large nonlinear systems
  • As discussed, the key issue for the convergence speed of iterative methods is the geometry (e.g. condition number of hessian, etc)
  • Preconditioning changes the geometry. e.g. more like circles or with eigenvalue problems spread out the eigenvalues of interest
  • Preconditioners for a matrix \(A\) requires art and tradeoffs
    • Want be relatively cheap to calculate, and must be invertible
    • Want to have \(\text{cond}(P A) \ll \text{cond}(A)\)
  • Ideal preconditioner for \(A x = b\) is \(P=A^{-1}\) since \(A^{-1} A x = x = A^{-1} b\)
    • \(\text{cond}(A^{-1}A)=1\)! But that is equivalent to solving problem

Right-Preconditioning a Linear System

\[ \begin{aligned} A x &= b\\ A P^{-1} P x &= b\\ A P^{-1} y &= b\\ P x &= y \end{aligned} \] That is, solve \((A P^{-1})y = b\) for \(y\), and then solve \(P x = y\) for \(x\).

Raw Conjugate Gradient

N = 200
A = sprand(N, N, 0.1)   # 10 percent non-zeros
A = A * A'
b = rand(N)
@show cond(Matrix(A))
sol = cg(A, b, log = true, maxiter = 1000)
sol[end]
cond(Matrix(A)) = 972303.1119375983
Converged after 389 iterations.

Diagonal Preconditioner

  • A simple preconditioner is the diagonal of \(A\)
  • This is cheap to calculate, and is invertible if the diagonal has no zeros
  • For some problems this has a huge impact on convergence/condition, for others it does nothing
P = DiagonalPreconditioner(A)
sol = cg(A, b; Pl = P, log = true, maxiter = 1000)
sol[end]
Converged after 367 iterations.

Incomplete LU or Cholesky

  • Iterate part of the way on an LU or Cholesky factorization
  • Not the total inverse, but can make a big difference
P = ilu(A, τ = 0.1)
sol = cg(A, b; Pl = P, log = true, maxiter = 1000)
sol[end]
Converged after 151 iterations.

Others

  • In the above we aren’t getting huge gains, but it is also lacking structure
  • If you have problems with multiple scales, as might come out of discretizing multiple dimensions in a statepsace, see multigrid methods
    • Algebraic Multigrid preconditioner is often useful even outside of having different scales
  • Other preconditioners include ones intended for Graph Laplacians such as approximate cholesky decompositions and combinatorial multigrid preconditioners.