Convergence in Probability

Paul Schrimpf

2025-10-27

Reading

  • Required: Song (2021) chapter 9
  • Independent study: Song (2021) chapters 6-7

Convergence in Probability

\[ \def\Er{{\mathrm{E}}} \def\En{{\mathbb{En}}} \def\cov{{\mathrm{Cov}}} \def\var{{\mathrm{Var}}} \def\R{{\mathbb{R}}} \newcommand\norm[1]{\left\lVert#1\right\rVert} \def\rank{{\mathrm{rank}}} \newcommand{\inpr}{ \overset{p^*_{\scriptscriptstyle n}}{\longrightarrow}} \def\inprob{{\,{\buildrel p \over \rightarrow}\,}} \def\indist{\,{\buildrel d \over \rightarrow}\,} \DeclareMathOperator*{\plim}{plim} \]

Convergence in Probability

Definition

Random vectors \(X_1, X_2, ...\) converge in probability to the random vector \(Y\) if for all \(\epsilon>0\) \[ \lim_{n \to \infty} P\left( \norm{X_n - Y} > \epsilon \right) = 0 \] denoted by \(X_n \inprob Y\) or \(\plim_{n \to \infty} X_n = Y\)

  • Typical use: \(X_n = \hat{\theta}_n\), estimator from \(n\) observations, \(Y=\theta_0\), a constant.
  • How to show that \(\hat{\theta}_n \inprob \theta_0\)?

\(L^p\) convergence

Definition

Random vectors \(X_1, X_2, ...\) converge in \(L^p\) to the random vector \(Y\) if \[ \lim_{n \to \infty} \Er\left[ \norm{X_n-Y}^p \right] \to 0 \]

  • \(p=2\) called convergence in mean square

Markov’s Inequality

Markov’s Inequality

\(P(|X|>\epsilon) \leq \frac{\Er[|X|^k]}{\epsilon^k}\) \(\forall \epsilon > 0, k > 0\)

  • \(P\left( \norm{X_n - Y} > \epsilon \right) \leq \frac{\Er[ \norm{X_n - Y}^k]} {\epsilon_k}\)

Convergence in \(L^p\) implies convergence in probability

Theorem 1.1

If \(X_n\) converges in \(L^p\) to \(Y\), then \(X_n \inprob Y\).

Application to Estimators

  • An estimator, \(\hat{\theta}\) is consistent if \(\hat{\theta} \inprob \theta_0\)

  • Implication for estimators: \[ \begin{aligned} MSE(\hat{\theta}_n) = & \Er[ \norm{\hat{\theta}_n - \theta_0}^2 ] \\ = & tr[\var(\hat{\theta}_n)] + Bias(\hat{\theta}_n)'Bias(\hat{\theta}_n) \end{aligned} \]

  • If \(MSE(\hat{\theta}_n) \to 0\), then \(\hat{\theta}_n \inprob \theta_0\)

  • If \(\lim_{n \to \infty} \Er[\hat{\theta}_n] \neq \theta_0\), then \(\plim \hat{\theta}_n \neq \theta_0\)

Consistency of Least-Squares

  • In \(y = X \beta_0 + u\), when does \(\hat{\beta} = (X'X)^{-1} X' y\) \(\inprob \beta_0\)?

  • Sufficient that \(MSE(\hat{\beta}) = tr[\var(\hat{\beta})] + Bias(\hat{\beta})'Bias(\hat{\beta}) \to 0\)

  • \(\var(\hat{\beta}) = \sigma^2 (X'X)^{-1}\) (treating \(X\) as non-stochastic)

  • \(tr((X'X)^{-1}) \leq \frac{k}{\lambda_{min}(X'X)}\)

Convergence in Probability of Functions

Theorem 2.2

If \(X_n \inprob X\), and \(f\) is continuous, then \(f(X_n) \inprob f(X)\)

Slutsky’s Lemma

If \(Y_n \inprob c\) and \(W_n \inprob d\), then

  • \(Y_n + W_n \inprob c+ d\)
  • \(Y_n W_n \inprob cd\)
  • \(Y_n / W_n \inprob c/d\) if \(d \neq 0\)

Weak Law of Large Numbers

Weak Law of Large Numbers

If \(X_1, ..., X_n\) are i.i.d. and \(\Er[X^2]\) exists, then \[ \frac{1}{n} \sum X_i \inprob \Er[X] \]

  • Proof: use Markov’s inequality

  • This is the simplest to prove WLLN, but there are many variants with alternate assumptions that also imply \(\frac{1}{n} \sum X_i \inprob \Er[X]\)

Consistency of Least Squares Revisited

  • In \(y = X \beta_0 + u\), when does \(\hat{\beta} \inprob \beta_0\)?

  • Treat \(X\) as stochastic

  • \(\hat{\beta} = \left(\frac{1}{n} \sum_{i=1}^n X_i X_i' \right)^{-1} \left(\frac{1}{n} \sum_{i=1}^n X_i y_i \right)\)

  • If WLLN applies to \(\frac{1}{n} \sum_{i=1}^n X_i X_i'\) and \(\frac{1}{n} \sum_{i=1}^n X_i y_i\) (and \(\Er[X_i X_i']^{-1}\) exists)

  • Sufficient for i.i.d, \(\Er[X_i u_i] = 0\), 4th moments of \(X_i\) to exist, \(\Er[u_i^2]\) to exist

Convergence Rates

Convergence Rates

Definition

Given a sequence of random variables, \(X_1, X_2, ...\) and constants \(b_1, b_2, ...\), then

  • \(X_n = O_p(b_n)\) if for all \(\epsilon > 0\) there exists \(M_\epsilon\) s.t. \[ \lim\sup P\left(\frac{\norm{X_n}}{b_n} \geq M_\epsilon \right) < \epsilon \]
  • \(X_n = o_p(b_n)\) if \(\frac{X_n}{b_n} \inprob 0\)

Example: Little \(o_p\)

  • Real valued \(X_1, ..., X_n\) i.i.d., with \(\Er[X] = \mu\), \(\var(X_i) = \sigma^2\)

  • Markov’s inequality \[ P\left( |\overbrace{\En X_i}^{\equiv \frac{1}{n} \sum_{i=1}^n X_i} - \mu | > a \right) \leq \frac{\var(\En X_i - \mu)}{a^2} \leq \frac{\sigma^2}{n a^2} \]

  • Let \(a = \epsilon n^{-\alpha}\), then \[ P\left( \frac{|\En X_i - \mu |}{n^{-\alpha}} > \epsilon \right) \leq \frac{\sigma^2}{n^{1 - 2\alpha}\epsilon^2} \]

  • \(|\En X_i - \mu | = o_p(n^{-\alpha})\) for \(\alpha \in (0, 1/2)\)

Example: Big \(O_p\)

  • Real valued \(X_1, ..., X_n\) i.i.d., with \(\Er[X] = \mu\), \(\var(X_i) = \sigma^2\)

  • Markov’s inequality \[ P\left( |\overbrace{\En X_i}^{\equiv \frac{1}{n} \sum_{i=1}^n X_i} - \mu | > a \right) \leq \frac{\var(\En X_i - \mu)}{a^2} \leq \frac{\sigma^2}{n a^2} \]

  • Let \(a = \sigma \epsilon^{-1/2} n^{-1/2}\), \[ P\left( \frac{|\En X_i - \mu |}{n^{-1/2}} > \underbrace{\sigma \epsilon^{-1/2}}_{M_\epsilon} \right) \leq \epsilon \] so \(|\En X_i - \mu | = O_p(n^{-\alpha})\) for \(\alpha \in (0, 1/2]\)

Simulation of Stochastic Order

  • \(\frac{1}{n} \sum_{i=1}^n x_i = O_p(n^{-1/2})\) means that by scaling the red line by \(M_\epsilon\), we can make the probability of the blue lines being above it less than \(\epsilon\)
Code
using Distributions, Plots, Statistics, Random, LaTeXStrings
Random.seed!(1234)
S = 100
n = 2:400
distribution = Exponential(1)
dgp(n) = rand(distribution, n)
μ = mean(distribution)
fig = plot()
for s in 1:S
    data = dgp(maximum(n))
    plot!(fig, n, abs.(mean.(data[1:i] for i in n) .- μ), alpha=0.2, label=false, color=:blue)
end
plot!(fig, n, 1 ./ sqrt.(n), label=L"n^{-1/2}", color=:red, lw=3)
fig

Simulation of Stochatic Order

  • \(\min_{i} x_i = o_p(1/n^{-1/2})\) implies all simulations should be below \(n^{-1/2}\) for \(n\) large enough
Code
using Distributions, Plots, Statistics, Random, LaTeXStrings
Random.seed!(1234)
S = 100
n = 2:400
distribution = Exponential(1)
dgp(n) = rand(distribution, n)
m₀ = minimum(distribution)
fig = plot()
for s in 1:S
    data = dgp(maximum(n))
    plot!(fig, n, abs.(minimum.(data[1:i] for i in n) .- m₀), alpha=0.2, label=false, color=:blue)
end
plot!(fig, n, 1 ./ sqrt.(n), label=L"n^{-1/2}", lw=3)
plot!(fig, n, 1 ./(n), label=L"n^{-1}", lw=3)
ylims!(fig, (0, 1/sqrt(10)))
fig

Big \(O\) and little \(o\) calculus

  • If \(X_n = o_p(a_n)\) and \(Y_n=o_p(b_n)\), then
    • \(X_n + Y_n = o_p(\max\{a_n,b_n\})\)
    • \(|X_n'Y_n| = o_p(a_n b_n)\)
  • If \(X_n = O_p(a_n)\) and \(Y_n=O_p(b_n)\), then
    • \(X_n + Y_n = O_p(\max\{a_n,b_n\})\)
    • \(|X_n'Y_n| = O_p(a_n b_n)\)

Non-Asymptotic Bounds

Non-Asymptotic Bounds

  • Let \(\Er[X] = \mu\)
  • Markov’s inequality: \(P(|X - \mu |>\epsilon) \leq \frac{\Er[|X - \mu|^k]}{\epsilon^k}\) \(\forall \epsilon > 0, k > 0\)
  • Idea: minimize left side over \(k\) to get a tighter bound

Non-Asymptotic Bounds

  • Markov’s inequality for \(e^{\lambda(X-\mu)}\) \[ P(X-\mu>\epsilon) = P\left( e^{\lambda (X - \mu)} > e^{\lambda \epsilon} \right) \leq e^{-\lambda \epsilon} \Er\left[e^{\lambda (X-\mu)}\right] \]
  • \(M_X(\lambda)=\Er\left[e^{\lambda (X-\mu)}\right]\) is the (centered) moment generating function
    • If \(X \sim N(\mu, \sigma^2)\), then \(\Er\left[e^{\lambda (X-\mu)}\right] = e^{\frac{\lambda^2 \sigma^2}{2}}\)
    • If \(|X| \leq b\), then \(\Er\left[e^{\lambda (X-\mu)}\right] \leq e^{\lambda^2 b^2}\)

Non-Asymptotic Bounds

  • Suppose \(\Er\left[e^{\lambda (X-\mu)}\right] \leq e^{\frac{\lambda^2 \sigma^2}{2}}\), then \[ P(X-\mu>\epsilon) \leq \inf_{\lambda \geq 0} e^{-\lambda \epsilon + \lambda^2 \sigma^2/2} = e^{-\frac{\epsilon^2}{2 \sigma^2}} \]

  • Suppose \(\Er\left[e^{\lambda (X_i-\mu)}\right] \leq e^{\frac{\lambda^2 \sigma^2}{2}}\) and \(X_i\) are independent, then \[ P\left(\frac{1}{n} \sum_{i=1}^n X_i-\mu >\epsilon\right) \leq e^{-\frac{\epsilon^2 n}{2 \sigma^2}} \]

Application: Generalization Bound

  • Given \(\hat{\beta} = \mathrm{arg}\min \frac{1}{n} \sum_{i=1}^n \ell( y_i, x_i; \beta)\), can we bound \[ \Er[\ell(y_i,x_i;\hat{\beta})]? \]

  • Yes, under reasonable assumptions, we can show \[ P\left( \Er[\ell(Y,X;\beta)] \geq \frac{1}{n} \sum_{i=1}^n \ell \left(y_i, x_i;\beta \right) + \sqrt{\frac{C_0 + C_1 \log(1/\delta) + k/2\log(n)}{n}} \forall \beta \in \mathscr{B} \right) \leq \delta \]

  • To show use:

    • Non-asymptotic bound for fixed \(\beta\) from above
    • Union-bound argument over finitely many different values of \(\beta\)
    • Compactness to approximate infinite values of \(\beta\) with finite set

References

Song, Kyunchul. 2021. “Introduction to Econometrics.”