Automatic Differentiation

Paul Schrimpf

2026-03-11

Introduction

Derivatives

  • Needed for efficient equation solving and optimization
  • Can calculate automatically

Finite Differences

f(x) = sin(x)/(1.0+exp(x))

function dxfin(f,x)
  h = sqrt(eps(x))
  if abs(x) > 1
    h = h*abs(x)
  end
  (f(x+h) - f(x) )/ h
end

dxfin(f, 2.0)
-0.1450763155960872

Forward Automatic Differentiation

module Forward

struct Dual{T}
  v::T
  dv::T
end

Dual(x::T) where {T} = Dual(x, one(x))

import Base: +, sin, exp, *, /

function (+)(a::T, x::Dual{T}) where {T}
  Dual(a+x.v, x.dv)
end

function (*)(y::Dual, x::Dual)
  Dual(y.v*x.v, x.v*y.dv + x.dv*y.v)
end

function (/)(x::Dual, y::Dual)
  Dual(x.v/y.v, x.dv/y.v - x.v*y.dv/y.v^2)
end

exp(x::Dual) = Dual(exp(x.v), exp(x.v)*x.dv)
sin(x::Dual) = Dual(sin(x.v), cos(x.v)*x.dv)


function fdx(f,x)
  out=f(Dual(x))
  (out.v, out.dv)
end

end

Forward.fdx(f,2.0)
(0.10839091026481387, -0.14507631594729084)

Reverse Automatic Differentiation

  • compute \(f(x)\) in usual forward direction, keep track of each operation and intermediate value
  • compute derivative “backwards”
    • \(f(x) = g(h(x))\)
    • \(f'(x) = g'(h(x)) h'(x)\)
  • scales better for high dimensional \(x\)
  • implementation more complicated
    • Simple-ish example https://simeonschaub.github.io/ReverseModePluto/notebook.html

Julia AD Packages

ForwardDiff

ForwardDiff Example

using Distributions, BenchmarkTools
function simulate_logit(observations, β)
  x = randn(observations, length(β))
  y = (x*β + rand(Logistic(), observations)) .>= 0.0
  return((y=y,x=x))
end

function logit_likelihood(β,y,x)
  p = map(xb -> cdf(Logistic(),xb), x*β)
  sum(log.(ifelse.(y, p, 1.0 .- p)))
end

n = 500
k = 3
β0 = ones(k)
(y,x) = simulate_logit(n,β0)

import ForwardDiff
@btime ∇L = ForwardDiff.gradient(b->logit_likelihood(b,y,x),β0)
  7.116 μs (18 allocations: 47.60 KiB)
3-element Vector{Float64}:
 -4.544200516511661
  0.8780749182470913
  6.355796397761692

ForwardDiff Notes

  • For \(f: \mathbb{R}^n \to \mathbb{R}^m\), the computation scales with \(n\)
    • best for moderate \(n\)
    • Throughout the slides, take the timings with a grain of salt — they depend on dimensions, and on the implementation of the code being differentiated
  • Code must be generic
    • be careful when allocating arrays
function wontwork(x)
  y = zeros(size(x))
  for i  eachindex(x)
    y[i] += x[i]*i
  end
  return(sum(y))
end

function willwork(x)
  y = zero(x)
  for i  eachindex(x)
    y[i] += x[i]*i
  end
  return(sum(y))
end

betterstyle(x) = sum(v*i for (i,v) in enumerate(x))

Reverse Mode AD

  • Reverse mode AD is a much harder problem
  • Various systems tradeoff complexity, efficiency, and scope of code compatibility

Enzyme

“Enzyme performs automatic differentiation (AD) of statically analyzable LLVM. It is highly-efficient and its ability to perform AD on optimized code allows Enzyme to meet or exceed the performance of state-of-the-art AD tools.”

import Enzyme
import Enzyme: Active, Duplicated, Const, BatchDuplicated

db = zero0)
@btime Enzyme.autodiff(Enzyme.ReverseWithPrimal,logit_likelihood, Active, Duplicated0,db), Const(y), Const(x))
db
'znver5' is not a recognized processor for this target (ignoring processor)
'znver5' is not a recognized processor for this target (ignoring processor)
'znver5' is not a recognized processor for this target (ignoring processor)
'znver5' is not a recognized processor for this target (ignoring processor)
'znver5' is not a recognized processor for this target (ignoring processor)
'znver5' is not a recognized processor for this target (ignoring processor)
  10.921 μs (20 allocations: 24.06 KiB)
3-element Vector{Float64}:
     -1.4048486780873116e6
 271458.61712383264
      1.9649071679487631e6
using LinearAlgebra
v = Tuple(Float64.(I(length0))[:,j]) for j in 1:length0))
@btime dbf = Enzyme.autodiff(Enzyme.ForwardWithPrimal, logit_likelihood, BatchDuplicated0,v), Const(y), Const(x))
# or
@btime dbf = Enzyme.gradient(Enzyme.ForwardWithPrimal, logit_likelihood, β0, Const(y), Const(x))
  6.364 μs (36 allocations: 48.09 KiB)
  6.830 μs (55 allocations: 48.70 KiB)
(derivs = ([-4.5442005165116655, 0.8780749182470942, 6.35579639776169], nothing, nothing), val = -248.8578363670095)

Enzyme Notes

  • Documentation is not suited to beginners
  • Does not work on all Julia code, but cases where it fails are not well documented. Calling Enzyme.API.runtimeActivity!(true) works around some errors.
  • Cryptic error messages. Enzyme operates on LLVM IR, and error messages often reference the point in the LLVM IR where the error occurred. Figuring out what Julia code the LLVM IR corresponds to is not easy.
  • See Enzyme.jl FAQ if getting strange errors

Above three downsides may be better now than two years ago when I first wrote this slide

  • Working off of LLVM IR means it depends on internals of Julia’s compiler
    • Can break with new Julia versions and take weeks to months to fix

Enzyme + Reactant

  • Reactant.jl compiles Julia to MLIR
    • useful for getting fast code for GPU or TPU
  • Enzyme can act on MLIR instead of LLVM IR
  • Sometimes Reactant + Enzyme is updated more quickly than Enzyme directly on Julia
  • Some limitations in Julia code supported, especially with dynamic control flow

Reactant

import Reactant
using Test
xr = Reactant.to_rarray(x)
yr = Reactant.to_rarray(Int.(y))
βr = Reactant.to_rarray0)

function logit_likelihood_r(β,y,x)
    p = map(xb -> 1/(1+exp(-xb)), x*β)
    sum(x->log(x[1]^x[2]*(1.0 - x[1])^(1-x[2])), zip(p,y))
end
@test logit_likelihood_r0, y, x)  logit_likelihood0, y, x)

∇llreact = Reactant.@compile Enzyme.gradient(Enzyme.ReverseWithPrimal,logit_likelihood_r, βr, Const(yr), Const(xr))
@btime ∇llreact(Enzyme.ReverseWithPrimal, logit_likelihood_r, βr, Const(yr), Const(xr))

Mooncake

  • Mooncake.jl performant reverse and forward mode AD supporting arbitrary Julia code
  • Operates on Julia IR
    • Julia updates can be incompatibility with Mooncake, and often take weeks to months to fix

Mooncake

import Mooncake as MC
= b->logit_likelihood(b,y,x)
cache = MC.prepare_gradient_cache(ℒ, β0)
@btime val, grad = MC.value_and_gradient!!(cache, ℒ, β0)
Precompiling packages...
   2484.2 msGraphs
  36300.9 msMooncake
  2 dependencies successfully precompiled in 39 seconds. 64 already precompiled.
Precompiling packages...
    873.7 msMooncake → MooncakeLogExpFunctionsExt
   1190.4 msMooncake → MooncakeSpecialFunctionsExt
  2 dependencies successfully precompiled in 1 seconds. 75 already precompiled.
Precompiling packages...
   9809.4 msMooncake → MooncakeDistributionsExt
  1 dependency successfully precompiled in 10 seconds. 100 already precompiled.
  18.175 μs (63 allocations: 41.34 KiB)
(-248.85783636700947, (Mooncake.NoTangent(), [-4.544200516511669, 0.8780749182470928, 6.355796397761697]))

FiniteDiff

  • FiniteDiff computes finite difference gradients– always test that whatever automatic or manual derivatives you compute are close to the finite difference versions
  • use a package for finite differences to handle rounding error well

Symbolic Differentiation

FastDifferentiation.jl

import FastDifferentiation as FD
βs=FD.make_variables(:β, length0))
FD.@variables ys
xs = FD.make_variables(:x, length0))
function logit_likei(β,y,x)
    p = cdf(Logistic(),x'*β)
    log(p^(y)*(1-p)^(1-y))
end
logit_likei(βs,ys,xs)
log((((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ ys) * ((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ (1 - ys))))
∇logit_likei_ex = FD.jacobian([logit_likei(βs, ys, xs)], βs)
1×3 Matrix{FastDifferentiation.Node}:
 (-(((((1 / (((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ ys) * ((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ (1 - ys)))) * ((((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ (1 - ys)) * (ys * ((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ (ys - 1)))) + -((((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ ys) * ((1 - ys) * ((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ ((1 - ys) - 1))))))) * -(((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)))) * exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))))) * x1)  …  (-(((((1 / (((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ ys) * ((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ (1 - ys)))) * ((((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ (1 - ys)) * (ys * ((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ (ys - 1)))) + -((((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) ^ ys) * ((1 - ys) * ((1 - (1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1))) ^ ((1 - ys) - 1))))))) * -(((1 / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)) / (exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))) + 1)))) * exp(-((((x1 * β1) + (x2 * β2)) + (x3 * β3)))))) * x3)
∇logit_likei = FD.make_function(∇logit_likei_ex, βs, [ys], xs, in_place=false)
function ∇logit_likelihood_FD(β,y,x)
    sum(∇logit_likei(β, [y[i]], x[i,:]) for i in 1:length(y))
end
@btime ∇logit_likelihood_FD0,y,x)
  66.225 μs (6497 allocations: 242.06 KiB)
1×3 Matrix{Float64}:
 -4.5442  0.878075  6.3558

DifferentiationInterface

import DifferentiationInterface as DI
DI.gradient(b->logit_likelihood(b,y,x), DI.AutoEnzyme(),β0)
Precompiling packages...
    794.4 msDifferentiationInterface → DifferentiationInterfaceMooncakeExt
  1 dependency successfully precompiled in 1 seconds. 71 already precompiled.
Precompiling packages...
    832.0 msDifferentiationInterface → DifferentiationInterfaceForwardDiffExt
  1 dependency successfully precompiled in 1 seconds. 22 already precompiled.
3-element Vector{Float64}:
 -4.544200516511671
  0.8780749182470924
  6.355796397761696
  • improve performance by reusing intermediate variables
backend = DI.AutoEnzyme()
= b->logit_likelihood(b,y,x)
dcache = DI.prepare_gradient(ℒ, backend, β0)
grad = zero0)
DI.gradient!(ℒ,grad, dcache, backend,β0)
3-element Vector{Float64}:
 -4.544200516511671
  0.8780749182470924
  6.355796397761696

Other Packages

Other Packages

ChainRules

  • ChainRules
  • used by many AD packages to define the derivatives of various functions.
  • Useful if you want to define a custom derivative rule for a function.

ReverseDiff.jl

  • ReverseDiff.jl a tape based reverse mode package

  • Long lived and well tested

  • limitations. Importantly, code must be generic and mutation of arrays is not allowed.

Tracker

Tracker is a tape based reverse mode package. It was the default autodiff package in Flux before being replaced by Zygote. No longer under active development, but solid and still used.

Zygote

  • Zygote.jl
  • Does not allow mutating arrays
  • Hard to develop and maintain, being phased out

Zygote Example

import Zygote
using LinearAlgebra
@time ∇Lz =  Zygote.gradient(b->logit_likelihood(b,y,x),β0)[1]
norm(∇L - ∇Lz)

Diffractor

Diffractor is an automatic differentiation package in development. It was once hoped to be the future of AD in Julia, but has been delayed. It plans to have both forward and reverse mode, but only forward mode is available so far.

References