ECON622: Computational Economics with Data Science Applications

Direct Methods and Matrix Factorizations

Jesse Perla

jesse.perla@ubc.ca

University of British Columbia

Overview

Motivation

In preparation for the ML lectures we cover some core numerical linear algebra concepts
Many of these are directly useful
- e.g. solving large systems of equations or as building blocks in bigger algorithms

Packages and Materials

See QuantEcon Numerical Linear Algebra and associated notebooks

using LinearAlgebra, Statistics, BenchmarkTools, SparseArrays, Random
using Plots
Random.seed!(42);  # seed random numbers for reproducibility

Complexity

Basic Computational Complexity

Big-O Notation

For a function \(f(N)\) and a positive constant \(C\), we say \(f(N)\) is \(O(g(N))\), if there exist positive constants \(C\) and \(N_0\) such that:

\[ 0 \leq f(N) \leq C \cdot g(N) \quad \text{for all } N \geq N_0 \]

Often crucial to know how problems scale asymptotically (as \(N\to\infty\))
Caution! This is only an asymptotic limit, and can be misleading for small \(N\)
- \(f_1(N) = N^3 + N\) is \(O(N^3)\)
- \(f_2(N) = 1000 N^2 + 3 N\) is \(O(N^2)\)
- For roughly \(N>1000\) use \(f_2\) algorithm, otherwise \(f_1\)

Examples of Computational Complexity

Simple examples:
- \(x \cdot y = \sum_{n=1}^N x_n y_n\) is \(O(N)\) since it requires \(N\) multiplications and additions
- \(A x\) for \(A\in\mathbb{R}^{N\times N},x\in\mathbb{R}^N\) is \(O(N^2)\) since it requires \(N\) dot products, each \(O(N)\)

Computational Complexity

Ask yourself whether the following is a computationally expensive operation as the matrix size increases

Multiplying two matrices?
- Answer: It depends. Multiplying two diagonal matrices is trivial.
Solving a linear system of equations?
- Answer: It depends. If the matrix is the identity, the solution is the vector itself.
Finding the eigenvalues of a matrix?
- Answer: It depends. The eigenvalues of a triangular matrix are the diagonal elements.

Numerical Precision

Machine Epsilon

For a given datatype, \(\epsilon\) is defined as \(\epsilon = \min_{\delta > 0} \left\{ \delta : 1 + \delta > 1 \right\}\)

Computers have finite precision. 64-bit typical, but 32-bit on GPUs

machine epsilon for float64 = 2.220446049250313e-16
1 + eps/2 == 1? true
machine epsilon for float32 = 1.1920929e-7

Matrix Structure

A key principle is to ensure you don’t lose “structure”
- e.g. if sparse, operations should keep it sparse if possible
- If triangular, then use appropriate algorithms instead of converting back to a dense matrix
Key structure is:
- Symmetry, diagonal, tridiagonal, banded, sparse, positive-definite
The worse operations for losing structure are matrix multiplication and inversion

Example Losing Sparsity

Here the density increases substantially

A = sprand(10, 10, 0.45)  # random sparse 10x10, 45 percent filled with non-zeros

@show nnz(A)  # counts the number of non-zeros
invA = sparse(inv(Array(A)))  # Julia won't invert sparse, so convert to dense with Array.
@show nnz(invA);

nnz(A) = 46
nnz(invA) = 100

Losing Tridiagonal Structure

An even more extreme example. Tridiagonal has roughly \(3N\) nonzeros. Inverses are dense \(N^2\)

N = 5
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
inv(A)

5×5 Matrix{Float64}:
  1.29099      -0.327957     0.0416667  -0.00537634   0.000672043
 -0.163978      1.31183     -0.166667    0.0215054   -0.00268817
  0.0208333    -0.166667     1.29167    -0.166667     0.0208333
 -0.00268817    0.0215054   -0.166667    1.31183     -0.163978
  0.000672043  -0.00537634   0.0416667  -0.327957     1.29099

Forming the Covariance and/or Gram Matrix

Another common example is \(A' A\)

A = sprand(20, 21, 0.3)
@show nnz(A) / 20^2
@show nnz(A' * A) / 21^2;

nnz(A) / 20 ^ 2 = 0.34
nnz(A' * A) / 21 ^ 2 = 0.9229024943310657

Specialized Algorithms

Besides sparsity/storage, the real loss is you miss out on algorithms. For example, lets setup the benchmarking code

using BenchmarkTools
function benchmark_solve(A, b)
    println("A\\b for typeof(A) = $(string(typeof(A)))")
    @btime $A \ $b
end

benchmark_solve (generic function with 1 method)

Compare Dense vs. Sparse vs. Tridiagonal

N = 1000
b = rand(N)
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
A_sparse = sparse(A)  # sparse but losing tridiagonal structure
A_dense = Array(A)    # dropping the sparsity structure, dense 1000x1000

# benchmark solution to system A x = b
benchmark_solve(A, b)
benchmark_solve(A_sparse, b)
benchmark_solve(A_dense, b);

A\b for typeof(A) = Tridiagonal{Float64, Vector{Float64}}
  16.600 μs (8 allocations: 47.72 KiB)
A\b for typeof(A) = SparseMatrixCSC{Float64, Int64}
  439.200 μs (80 allocations: 1.03 MiB)
A\b for typeof(A) = Matrix{Float64}
  14.231 ms (4 allocations: 7.64 MiB)

Triangular Matrices and Back/Forward Substitution

A key example of a better algorithm is for triangular matrices
Upper or lower triangular matrices can be solved in \(O(N^2)\) instead of \(O(N^3)\)

b = [1.0, 2.0, 3.0]
U = UpperTriangular([1.0 2.0 3.0;
                     0.0 5.0 6.0;
                     0.0 0.0 9.0])
U \ b

3-element Vector{Float64}:
 0.0
 0.0
 0.3333333333333333

Backwards Substitution Example

\[ \begin{aligned} U x &= b\\ U &\equiv \begin{bmatrix} 3 & 1 \\ 0 & 2 \\ \end{bmatrix}, \quad b = \begin{bmatrix} 7 \\ 2 \\ \end{bmatrix} \end{aligned} \]

Solving bottom row for \(x_2\)

\[ 2 x_2 = 2,\quad x_2 = 1 \]

Move up a row, solving for \(x_1\), substituting for \(x_2\)

\[ 3 x_1 + 1 x_2 = 7,\quad 3 x_1 + 1 \times 1 = 7,\quad x_1 = 2 \]

Generalizes to many rows. For \(L\) it is “forward substitution”

Factorizations

Factorizing matrices

Just like you can factor \(6 = 2 \cdot 3\), you can factor matrices
Unlike integers, you have more choice over the properties of the factors
Many operations (e.g., solving systems of equations, finding eigenvalues, inverting, finding determinants) have a factorization done internally
- Instead you can often just find the factorization and reuse it
Key factorizations: LU, QR, Cholesky, SVD, Schur, Eigenvalue

LU(P) Decompositions

We can “factor” any square \(A\) into \(P A = L U\) for triangular \(L\) and \(U\). Invertible can have \(A = L U\), called the LU decomposition. “P” is for partial-pivoting
Singular matrices may not have full-rank \(L\) or \(U\) matrices

N = 4
A = rand(N, N)
b = rand(N)
# chooses the right factorization based on matrix structure
# LU here
Af = factorize(A)
Af.P * A ≈ Af.L * Af.U

true

Using a Factorization

In Julia the factorization objects typically overload the \ and functions such as inv

@show Af \ b
@show inv(Af) * b

Af \ b = [1.567631093835083, -1.8670423474177864, -0.7020922312927874, 1.0653095651070625]
inv(Af) * b = [1.5676310938350828, -1.8670423474177873, -0.7020922312927873, 1.0653095651070625]

4-element Vector{Float64}:
  1.5676310938350828
 -1.8670423474177873
 -0.7020922312927873
  1.0653095651070625

LU Decompositions and Systems of Equations

Pivoting is typically implied when talking about “LU”
Used in the default solve algorithm (without more structure)
Solving systems of equations with triangular matrices: for \(A x = L U x = b\)
1. Define \(y = U x\)
2. Solve \(L y = b\) for \(y\) and \(U x = y\) for \(x\)
Since both are triangular, process is \(O(N^2)\) (but LU itself \(O(N^3)\))
Could be used to find inv
- \(A = L U\) then \(A A^{-1} = I = L U A^{-1} = I\)
- Solve for \(Y\) in \(L Y = I\), then solve \(U A^{-1} = Y\)
Tight connection to textbook Gaussian elimination (including pivoting)

Cholesky

LU is for general invertible matrices, but it doesn’t use positive-definiteness or symmetry
The Cholesky is the right factorization for general positive-definite matrices. For general symmetric matrices you can use Bunch-Kaufman or others
\(A = L L'\) for lower triangular \(L\) or equivalent for upper triangular

N = 500
B = rand(N, N)
A_dense = B' * B  # an easy way to generate a symmetric positive semi-definite matrix
A = Symmetric(A_dense)  # flags the matrix as symmetric
println("A is symmetric? $(issymmetric(A))")

A is symmetric? true

Comparing Cholesky

By default it doesn’t know the matrix is positive-definite, so factorize is the best it can do given symmetric structure

b = rand(N)
factorize(A) |> typeof
cholesky(A) \ b  # use the factorization to solve

benchmark_solve(A, b)
benchmark_solve(A_dense, b)
@btime cholesky($A, check = false) \ $b;

A\b for typeof(A) = Symmetric{Float64, Matrix{Float64}}
  4.027 ms (8 allocations: 2.16 MiB)
A\b for typeof(A) = Matrix{Float64}
  8.041 ms (4 allocations: 1.92 MiB)
  3.510 ms (3 allocations: 1.91 MiB)

Eigen Decomposition

For square, symmetric, non-singular matrix \(A\) factor into

\[ A = Q \Lambda Q^{-1} \]

\(Q\) is a matrix of eigenvectors, \(\Lambda\) is a diagonal matrix of paired eigenvalues
For symmetric matrices, the eigenvectors are orthogonal and \(Q^{-1} Q = Q^T Q = I\) which form an orthonormal basis
Orthogonal matrices can be thought of as rotations without stretching
More general matrices all have a Singular Value Decomposition (SVD)
With symmetric \(A\), an interpretation of \(A x\) is that we can first rotate \(x\) into the \(Q\) basis, then stretch by \(\Lambda\), then rotate back

Calculating the Eigen Decomposition

A = Symmetric(rand(5, 5))  # symmetric matrices have real eigenvectors/eigenvalues
A_eig = eigen(A)
Λ = Diagonal(A_eig.values)
Q = A_eig.vectors
@show norm(Q * Λ * inv(Q) - A)
@show norm(Q * Λ * Q' - A);

norm(Q * Λ * inv(Q) - A) = 3.99347661787232e-15
norm(Q * Λ * Q' - A) = 2.6225347327349657e-15

Eigendecompositions and Matrix Powers

Can be used to find \(A^t\) for large \(t\) (e.g. for Markov chains)
- \(P^t\), i.e. \(P \cdot P \cdot \ldots \cdot P\) for \(t\) times
- \(P = Q \Lambda Q^{-1}\) then \(P^t = Q \Lambda^t Q^{-1}\) where \(\Lambda^t\) is just the pointwise power
Related can find matrix exponential \(e^A\) for square matrices
- \(e^A = Q e^\Lambda Q^{-1}\) where \(e^\Lambda\) is just the pointwise exponential
- Useful for solving differential equations, e.g. \(y' = A y\) for \(y(0) = y_0\) is \(y(t) = e^{A t} y_0\)

Large Scale Systems of Equations

Packages that solve BIG problems with “direct methods” include MUMPS, Paradiso, UMFPACK, and many others
Sparse solvers are bread-and-butter scientific computing, so they can crush huge problems, parallelize on a cluster, etc.
But for smaller problems they may not be ideal. Profile and test, and only if you need it.
In Julia, the SciML package LinearSolver.jl is your best bet there, as it lets you swap out backends to profile
On Python harder to flip between them, but scipy has many built in and many wrappers exist. Same with Matlab

Preview of Conditioning

It will turn out that for iterative methods, a different style of algorithm, it is often necessary to multiple by a matrix to transform the problem
The ideal transform would be the matrix’s inverse, which requires a full factorization.
But instead, you can do only part of the way towards the factorization. e.g., part of the way on gaussian elimination
Called “Incomplete Cholesky”, “Incomplete LU”, etc.

Continuous Time Markov Chains

Markov Chains Transitions in in Continuous Time

For a discrete number of states, we cannot have instantaneous transitions between states or it ceases to be measurable
Instead: intensity of switching from state \(i\) to \(j\) as a \(q_{ij}\) where

\[ \mathbb P \{ X(t + \Delta) = j \,|\, X(t) \} = \begin{cases} q_{ij} \Delta + o(\Delta) & i \neq j\\ 1 + q_{ii} \Delta + o(\Delta) & i = j \end{cases} \]

With \(o(\Delta)\) is little-o notation. That is, \(\lim_{\Delta\to 0} o(\Delta)/\Delta = 0\).

Intensity Matrix

\(Q_{ij} = q_{ij}\) for \(i \neq j\) and \(Q_{ii} = -\sum_{j \neq i} q_{ij}\)
Rows sum to 0
For example, consider a counting process

\[ Q = \begin{bmatrix} -0.1 & 0.1 & 0 & 0 & 0 & 0\\ 0.1 &-0.2 & 0.1 & 0 & 0 & 0\\ 0 & 0.1 & -0.2 & 0.1 & 0 & 0\\ 0 & 0 & 0.1 & -0.2 & 0.1 & 0\\ 0 & 0 & 0 & 0.1 & -0.2 & 0.1\\ 0 & 0 & 0 & 0 & 0.1 & -0.1\\ \end{bmatrix} \]

Probability Dynamics

The \(Q\) is the infinitesimal generator of the stochastic process.
Let \(\pi(t) \in \mathbb{R}^N\) with \(\pi_i(t) \equiv \mathbb{P}[X_t = i\,|\,X_0]\)
Then the probability distribution evolution (Fokker-Planck or KFE), is \[ \frac{d}{dt} \pi(t) = \pi(t) Q,\quad \text{ given }\pi(0) \]
Or, often written as \(\frac{d}{dt} \pi(t) = Q^{\top} \cdot \pi(t)\), i.e. in terms of the “adjoint” of the linear operator \(Q\)
A steady state is then a solution to \(Q^{\top} \cdot \bar{\pi} = 0\)
- i.e., the \(\bar{\pi}\) left-eigenvector associated with eigenvalue 0 (i.e. \(\bar{\pi} Q = 0\times \bar{\pi}\))

Setting up a Counting Process

alpha = 0.1
N = 6
Q = Tridiagonal(fill(alpha, N - 1),
                [-alpha; fill(-2*alpha, N - 2); -alpha],
                fill(alpha, N - 1))
Q

6×6 Tridiagonal{Float64, Vector{Float64}}:
 -0.1   0.1    ⋅     ⋅     ⋅     ⋅ 
  0.1  -0.2   0.1    ⋅     ⋅     ⋅ 
   ⋅    0.1  -0.2   0.1    ⋅     ⋅ 
   ⋅     ⋅    0.1  -0.2   0.1    ⋅ 
   ⋅     ⋅     ⋅    0.1  -0.2   0.1
   ⋅     ⋅     ⋅     ⋅    0.1  -0.1

Finding the Stationary Distribution

There will always be at least one eigenvalue of 0, and the corresponding eigenvector is the stationary distribution
Teaser: Do we really need all of the eigenvectors/eigenvalues?

Lambda, vecs = eigen(Array(Q'))
@show Lambda
vecs[:, N] ./ sum(vecs[:, N])

Lambda = [-0.3732050807568874, -0.29999999999999993, -0.19999999999999998, -0.09999999999999995, -0.026794919243112274, 0.0]

6-element Vector{Float64}:
 0.16666666666666657
 0.16666666666666657
 0.1666666666666667
 0.16666666666666682
 0.16666666666666685
 0.16666666666666663

Using the Generator in a Bellman Equation

Let \(r \in \mathbb{R}^N\) be a vector of payoffs in each state, and \(\rho > 0\) a discount rate,
Then we can use the \(Q\) generator as a simple Bellman Equation (using the Kolmogorov Backwards Equation) to find the value \(v\) in each state,

\[ \rho v = r + Q v \]

Rearranging, \((\rho I - Q) v = r\)
Teaser: can we just implement \((\rho I - Q)\cdot v\) and avoid factorizing the matrix?

Implementing the Bellman Equation

rho = 0.05
r = range(0.0, 10.0, length=N)
@show typeof(rho*I - Q)

# solve (rho * I - Q) v = r
v = (rho * I - Q) \ r

typeof(rho * I - Q) = Tridiagonal{Float64, Vector{Float64}}

6-element Vector{Float64}:
  38.15384615384615
  57.23076923076923
  84.92307692307693
 115.07692307692311
 142.76923076923077
 161.84615384615384

ECON622: Computational Economics with Data Science Applications

Overview

Motivation

Packages and Materials

Complexity

Basic Computational Complexity

Examples of Computational Complexity

Computational Complexity

Numerical Precision

Matrix Structure

Matrix Structure

Example Losing Sparsity

Losing Tridiagonal Structure

Forming the Covariance and/or Gram Matrix

Specialized Algorithms

Compare Dense vs. Sparse vs. Tridiagonal

Triangular Matrices and Back/Forward Substitution

Backwards Substitution Example

Factorizations

Factorizing matrices

LU(P) Decompositions

Using a Factorization

LU Decompositions and Systems of Equations

Cholesky

Comparing Cholesky

Eigen Decomposition

Calculating the Eigen Decomposition

Eigendecompositions and Matrix Powers

More on Factorizations

Large Scale Systems of Equations

Preview of Conditioning

Continuous Time Markov Chains

Markov Chains Transitions in in Continuous Time

Intensity Matrix

Probability Dynamics

Setting up a Counting Process

Finding the Stationary Distribution

Using the Generator in a Bellman Equation

Implementing the Bellman Equation