ECON526: Quantitative Economics with Data Science Applications

Probability and Uncertainty

Jesse Perla

University of British Columbia

Overview

Summary

Will provide background on probability, simulation of randomness, independence, and expectations
See the following for extra material - some of which were used in these notes
Using the following packages and definitions

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats
import seaborn as sns
from matplotlib.animation import FuncAnimation
import IPython.display

Probability

Definitions

To formalize probability always be careful to separate

Events i.e., probability space.
Probability an events occurs. i.e., probability measure
Value or implications of an event. i.e., random variables

Probability Space

Probability space is a $(\Omega, \mathcal{A})$:

Set $\Omega$ of possible outcomes and $\omega \in \Omega$ is a particular outcome
- e.g. $\Omega = \{U, E, R, D\}$ for unemployed, employed, retired, or dead
Subsets $A \subseteq \Omega$ are events
- e.g. $A = \{U, E\}$ is the event of being employed or unemployed
- $\Omega \setminus A = \{R, D\}$ (the \setminus) is event of not being either
The collection of all possible events is $\mathcal{A}$ where $A \in \mathcal{A}$
- $\Omega \in \mathcal{A}$, i.e. we can consider the event of any outcome occurring
- $\emptyset \in \mathcal{A}$, i.e. we can consider the event of nothing occurring

Probability Measure

Probability Measure is a function which assigns a numerical value on the likelihood of an event

For us, $\mathbb{P} : \mathcal{A} \rightarrow [0,1]$
- e.g. $\mathbb{P}(\{U, E\}) = 0.7$ is probability either $U$ or $E$
- $\mathbb{P}(\Omega \setminus \{U, E\}) = 0.3$
Will see denoted as a function, $\mu(A)$ for integrals in advanced uses
- Overkill for probability spaces with a finite, discrete number of elements
- Important for probability spaces with a continuous number of elements
- Essential for stochastic processes (e.g., flipping a coin until heads)

Random Variables

Random Variable: $X(\omega)$ assigns a numerical value to a particular outcome

$X : \Omega \rightarrow \mathbb{R}$, but could be vector or matrix valued
- e.g. $X(\omega = E) = 1$ if employed, $X(\omega = U) = 0$ if unemployed. Useful for doing counts
Or $X(\omega = E) = \$40,000$ if employed, $X(\omega = U) = \$15,000$ if unemployed. Useful for finding average incomes
Random variables defined on $\Omega$, and inherit the probability measure
- So can query values like $\mathbb{P}(X =\$40,000)$

Axioms of Probability

Probability measure $\mathbb{P}$ on probability space $(\Omega, \mathcal{A})$ must satisfy axioms:

Non-negativity: $\mathbb{P}(A) \geq 0$
Normalization: $\mathbb{P}(\Omega) = 1$
Additivity: If $A \cap B = \emptyset$, then $\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B)$

These imply other results such as:

$\mathbb{P}(\emptyset) = 0$
$\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)$
$\mathbb{P}(\Omega \setminus A) = 1 - \mathbb{P}(A)$

Discrete Distributions

A discrete probability spaces have finite (or countable) number of outcomes
When convenient, we can number the outcomes arbitrarily as $n = 1,\ldots N$ (or $\infty$) and then work with $\Omega = \{1, \ldots N\}$ and $\omega \in \Omega$
Axioms especially simple because we use $\mathbb{P}(\omega = n) = p_n$,
- Non-negativity: $p_n \geq 0$
- Normalization: $\sum\limits_{n=1}^{N} p_n = 1$
- Additivity: $\mathbb{P}(A) = \sum\limits_{n \in A} p_n$

Random Variables

Notation can become a little confusing because we will sometimes use the same index number for the event and for the random value, but they are separate!
Frequently we will assign the random variable as just that index
- $X(\omega = n) = n$ and then denote $\mathbb{P}(X = n) = p_n$
Other times we may want to associate a value with each outcome
- $X(\omega = n) = x_n$ and then denote $\mathbb{P}(X = x_n) = p_n$

PDF and CDF

Probability Mass Function (PMF) is the probability of a single outcome for random variable $X$. Will assume $X$ itself has discrete values \[ p_n \equiv \mathbb{P}(X = n) \]
Cumulative Distribution Function (CDF) is the probability of all outcomes less than or equal to a particular outcome. \[ \mathbb{P}(X \leq n) = \sum\limits_{i=1}^{n} p_i \]

Expectation

Expectation of a random variable is the sum of the values weighted by the probabilities. Continuous $\Omega$ uses integrals, or measure theory if “weird”
Especially easy to compute for discrete random variables \[ \mathbb{E}[X] = \sum_{n=1}^N x_n \mathbb{P}(X = x_n) \]
Generalized to functions of a random variables \[ \mathbb{E}[f(X)] = \sum_{n=1}^N f(x_n) \mathbb{P}(X = x_n) \]

Expectations and Linear Algebra

Vectors can help with the accounting and notation of expectations. Let

$x \equiv \begin{bmatrix} x_1 & x_2 & \ldots & x_N \end{bmatrix}^{\top}$ be the list of values for the random variable $X$
$p \equiv \begin{bmatrix} p_1 & p_2 & \ldots & p_N \end{bmatrix}^{\top}$ be the list of probabilities
Then the expectation is (broadcasting $f(\cdot)$ across $x$ as required)

\[ \begin{aligned} \mathbb{E}[X] &= \sum_{n=1}^N x_n \mathbb{P}(X = x_n) = p \cdot x = p^{\top} x\\ \mathbb{E}[f(X)] &= \sum_{n=1}^N f(x_n) \mathbb{P}(X = x_n) = p \cdot f(x) = p^{\top} f(x) \end{aligned} \]

Example with a Discrete Distribution

$\Omega = \{U, E, R\}$
$\mathbb{P}(U) = 0.1, \mathbb{P}(E) = 0.8, \mathbb{P}(R) = 0.1$
$X(U) = 15000, X(E) = 40000, X(R) = 10000$
$\mathbb{E}[X]$ and $\mathbb{E}[\sqrt{X}]$

p = np.array([0.1, 0.8, 0.1])
x = np.array([15000, 40000, 10000])
def f(x):
    return np.sqrt(x)
print(f"E(X) = {p @ x}")
print(f"E(f(X)) = {p @ f(x)}")
print(f"CDF(X) = {np.cumsum(p)}")

E(X) = 34500.0
E(f(X)) = 182.2474487139159
CDF(X) = [0.1 0.9 1. ]

Note that the CDF was easy to calculate as cumulative sums. Interpretable?

Using the `discrete_rv`

scipy.stats has a discrete_rv type with built-in functions
Might need to start with sorted x values
Useful for working with discrete random variables

p = np.array([0.1, 0.1, 0.8])
x = np.array([10000, 15000, 40000])
u = scipy.stats.rv_discrete(
  values=(x, p))
samples = u.rvs(size=5)
print(f"E(X) = {u.mean()}")
print(f"E(f(X)) = {u.expect(f)}")
print(f"CDF(X) = {u.cdf(x)}")
print(f"Samples of X = {samples}")

E(X) = 34500.0
E(f(X)) = 182.2474487139159
CDF(X) = [0.1 0.2 1. ]
Samples of X = [15000 40000 40000 40000 40000]

Histogram $N=50$

N = 50
samples = u.rvs(size=N)
ax = sns.histplot(samples,
  stat="proportion")

# Alternative
# Density doesn't add up
# plt.hist(samples_1,
#   density=True)
# Or must build barchart

Histogram $N=500$

N = 500
samples = u.rvs(size=N)
ax = sns.histplot(samples,
  stat="proportion")

The Binomial Distribution

For $n = 1 \ldots N$, the binomial distribution is defined by the PMF

\[ \displaystyle \begin{aligned} \mathbb{P}(X = n) &= \binom{N}{n} \theta^n (1-\theta)^{N-n}\\ \mathbb{E}(X) &= \sum_{n=0}^N n \binom{N}{n} \theta^n (1-\theta)^{N-n} = N \theta \end{aligned} \]

N = 10
θ = 0.5
u = scipy.stats.binom(N, θ)
print(f"Mean: {u.mean():.2f}")
print(f"Variance: {u.var():.2f}")
print(f"Draws of u: {u.rvs(5)}")

Mean: 5.00
Variance: 2.50
Draws of u: [3 4 7 5 5]

The Binomial Probability Mass Function

grid = np.arange(N+1)
u_pmf = u.pmf(grid)

fig, ax = plt.subplots()
ax.plot(grid, u_pmf,
  linestyle='',
  marker='o',
  alpha=0.8, ms=4)
ax.vlines(grid, 0,
  u_pmf,
  lw=0.2)
ax.set_xticks(grid)
plt.show()

The Binomial Cumulative Distribution Function

\[ \mathbb{P}(X \leq n) = \sum_{i=0}^n \binom{N}{i} \theta^i (1-\theta)^{N-i} \]

grid = np.arange(N+1)
u_cdf = u.cdf(grid)

fig, ax = plt.subplots()
ax.step(grid, u_cdf)
ax.vlines(grid, 0, u_cdf,
  lw=0.2)
ax.set_xticks(grid)
plt.show()

Histogram $N=50$

N = 50
u = scipy.stats.binom(10, 0.5)
plt.hist(u.rvs(size=N),
  density=True,
  align='left')
plt.show()

Histogram $N=5000$

N = 5000
u = scipy.stats.binom(10, 0.5)
plt.hist(u.rvs(size=N),
  density=True,
  align='left')
plt.show()

LLN and CLT

Law of Large Numbers (LLN)

A classic LLN is the Strong Law of Large Numbers
Take a sequence of independent, identically distributed random variables $X_1, X_2, \ldots$ with $\mathbb{E}[X_i] = \mu$ and $\mathbb{V}[X_i] = \sigma^2 < \infty$. Then,
- If $X_n$ is a random variable then $\bar{X}_N \equiv \frac{1}{N}\sum_{i=1}^N X_n$ is also an RV
- The law says for any $\epsilon > 0$, $\lim_{N\to\infty} \mathbb{P}(|\bar{X}_N - \mu| > \epsilon) \rightarrow 0$
- Sometimes denoted $\bar{X}_N \xrightarrow{p} \mu$ for “convergence in probability”
Powerful and frequently used, but remember assumptions!

Visualizing the LLN with Gaussian RVs $N=20$

N = 20  # Number of samples
mu, sigma = 0, 1
np.random.seed(42)
samples = np.random.normal(mu, sigma, N)
sample_means = np.cumsum(samples) / np.arange(1, N + 1)
plt.scatter(range(1, N + 1), samples, label='Individual Samples', alpha=0.6, s=10)
plt.plot(range(1, N + 1), sample_means, label='Sample Mean', linewidth=2)
plt.axhline(mu, color='r', linestyle='--', label='True Mean')
for n in range(N): # add lines to samples from sample mean
    plt.plot([n + 1, n + 1], [sample_means[n], samples[n]], color='gray', linewidth=0.5, alpha=0.6)
plt.xlabel('Sample Number $N$')
plt.legend()
plt.show()

Visualizing the LLN with Gaussian RVs $N=20$

Visualizing the LLN with Gaussian RVs $N=100$

N = 100  # Number of samples
mu, sigma = 0, 1
np.random.seed(42)
samples = np.random.normal(mu, sigma, N)
sample_means = np.cumsum(samples) / np.arange(1, N + 1)
plt.scatter(range(1, N + 1), samples, label='Individual Samples', alpha=0.6, s=10)
plt.plot(range(1, N + 1), sample_means, label='Sample Mean', linewidth=2)
plt.axhline(mu, color='r', linestyle='--', label='True Mean')
for n in range(N): # add lines to samples from sample mean
    plt.plot([n + 1, n + 1], [sample_means[n], samples[n]], color='gray', linewidth=0.5, alpha=0.6)
plt.xlabel('Sample Number $N$')
plt.legend()
plt.show()

Visualizing the LLN with Gaussian RVs $N=100$

Pareto Distributions

Pareto distributions are a family of distributions with a power-law tail
Parameterized by $(x_m, \alpha)$ with the PDF \[ p(x) = \frac{\alpha x_m^{\alpha}}{x^{\alpha+1}} \]
The mean is $\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}$ for $\alpha > 1$
The variance is $\mathbb{V}[X] = \frac{\alpha x_m^2}{(\alpha - 1)^2(\alpha - 2)}$ for $\alpha > 2$

Visualizing the Sample Means for a Pareto Distribution

N = 100 # Number of samples
alpha = 1.5
np.random.seed(42)
dist = scipy.stats.pareto(alpha)
samples = dist.rvs(size=N)
sample_means = np.cumsum(samples) / np.arange(1, N + 1)
plt.scatter(range(1, N + 1), samples, label='Individual Samples', alpha=0.6, s=10)
plt.plot(range(1, N + 1), sample_means, label='Sample Mean', linewidth=2)
plt.axhline(dist.mean(), color='r', linestyle='--', label='True Mean')
for n in range(N): # add lines to samples from sample mean
    plt.plot([n + 1, n + 1], [sample_means[n], samples[n]], color='gray', linewidth=0.5, alpha=0.6)
plt.xlabel('Sample Number $N$')
plt.legend()
plt.show()

Visualizing the Sample Means for a Pareto Distribution

Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) is a classic result in statistics
Again, lets assume we have IID observations with $\mathbb{E}[X_i] = \mu$ and $\mathbb{V}[X_i] = \sigma^2 < \infty$
Define the sample mean $\bar{X}_N \equiv \frac{1}{N} \sum_{i=1}^N X_i$
Then the CLT is

\[ \sqrt{n} \left( \bar{X}_n - \mu \right) \xrightarrow{d} \mathcal{N}(0, \sigma^2) \]
- That notation means converges in distribution, which roughly means that as $n\to\infty$ the CDF are getting closer to each other

Visualizing the CLT with Exponential Distributions

See QuantEcon CLT lecture for the source.
Exponential distributions $p(x) = \lambda e^{-\lambda x}$ for $\lambda = 0.5$

distribution = scipy.stats.expon(1/0.5)
mu, s = distribution.mean(), distribution.std()
x = np.linspace(0, 15.0, 1000)
y = distribution.pdf(x)
plt.plot(x, y)
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

CLT of Exponential to $N=100$

fig, ax = plt.subplots()
def update(n):
    ax.clear()
    data = distribution.rvs((5000, n))
    sample_means = data.mean(axis=1)
    Y = np.sqrt(n) * (sample_means - mu)
    ax.set_xlim(-3 * s, 3 * s)
    ax.set_ylim(0, 0.5)
    ax.hist(Y, bins=60, alpha=0.5,
      density=True)
    ax.set_title(f"CLT for $N = {n}$")
ani = FuncAnimation(fig,update,
  frames=range(1, 100, 5),
  interval=500,blit=False, repeat=False)
plt.close()
IPython.display.HTML(ani.to_html5_video())

CLT of Exponential to $N=2000$

fig, ax = plt.subplots()
def update(n):
    ax.clear()
    data = distribution.rvs((10000, n))
    sample_means = data.mean(axis=1)
    Y = np.sqrt(n) * (sample_means - mu)
    ax.set_xlim(-3 * s, 3 * s)
    ax.set_ylim(0, 0.5)
    ax.hist(Y, bins=60, alpha=0.5, density=True)
    ax.set_title(f"CLT for $N = {n}$")
    x = np.linspace(-3 * s, 3 * s, 100)
    ax.plot(x, scipy.stats.norm.pdf(x, 0, s), 'r-', lw=2, alpha=0.7, label='N(0, 1)')    
ani = FuncAnimation(fig,update,frames=range(1, 2000, 250), interval=500,blit=False, repeat=False)
plt.close()
IPython.display.HTML(ani.to_html5_video())

CLT of Exponential to $N=2000$

Joint Distributions

Joint Probability Distributions

Key concepts are marginal distributions, conditional distributions, independence, and conditional expectations
Will demonstrate with bivariate discretely valued distributions
- Similar for multivariate distributions, except we replace sums with sums over multiple indices
- Similar for continuous or mixed discrete-continuous distributions, except we replace sums with integrals
Interpretation of the joint distribution of $X$ and $Y$ is the probability of each pair of outcomes occurs
- e.g., prob you get a cash transfer and are unemployed, don’t get a cash transfer and are unemployed, get a cash transfer and are employed, etc.

Bivariate Probability Distributions

Let $X,Y$ be two discrete random variables that take values:

\[ X\in\{1,\ldots,I\},\quad Y\in\{1,\ldots,J\} \]
Then their joint distribution is described by a matrix

\[ P\equiv[\mathbb{P}(X=i,Y=j)]_{i=1\ldots I, j=1,\ldots J}\in \mathbb{R}^{I\times J} \]
Which fulfills the key axioms of probability

\[ \begin{aligned} p_{ij}\equiv\mathbb{P}(X=i,Y=j) &\geq 0\\ \sum_{i=1}^{I}\sum_{j=1}^{J}p_{ij}=1 \end{aligned} \]

Marginal Probability Distributions

The joint distribution induces marginal distributions

\[ \begin{aligned} \mathbb{P}(X=i)& = \sum_{j=1}^{J}p_{ij} = \mu_i, \quad i=1,\ldots,I\\ \mathbb{P}(Y=j)&= \sum_{i=1}^{I}p_{ij} = \nu_j, \quad j=1,\ldots,J \end{aligned} \]
The marginal distributions are also probability distributions
- i.e., $\mu_i \geq 0$ and $\sum_{i=1}^{I} \mu_i = 1$
- e.g. the probability you were given a conditional cash transfer regardless of your employment status

Conditional Probability

Conditional probabilities are defined according to

\[ \mathbb{P}(A \,|\, B)=\frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \]
$A \cap B$ is the event that both $A$ and $B$ occur, i.e., the intersection
- e.g. you were given a conditional cash transfer AND you were unemployed
The conditional probability is the probability of $A$ given $B$ has occurred
- e.g. the probability you were given a conditional cash transfer given you were unemployed

Conditional Distributions

For a pair of discrete random variables, we have the conditional distribution

\[ \mathbb{P}(X=i|Y=j)=\frac{p_{ij}}{\sum_{i=1}^{I}p_{ij}} =\frac{\mathbb{P}(X=i, Y=j)}{\mathbb{P}(Y=j)} \]

Fix $Y=j$, then the conditional distribution of $X \,|\,Y=j$ is a probability distribution. Trivially positive since $p_ij \geq 0$. Verify it sums to $1$

\[ \sum_{i=1}^{I}\mathbb{P}(X=i\,|\,Y=j) =\frac{\sum_{i=1}^{I}p_{ij}}{\sum_{i=1}^{I}p_{ij}}=1 \]

Law of Total Probability

Law of Total Probability is a useful identity for conditional probabilities
- Let $A_1, \ldots, A_N$ be a partition of $\Omega$
- i.e., $\Omega = \cup_{i=1}^{N} A_i$ and $A_i \cap A_j = \emptyset$ for $i \neq j$
Then for any event $B$, \[ \mathbb{P}(B) = \sum_{i=1}^{N} \mathbb{P}(B \cap A_i) = \sum_{i=1}^{N} \mathbb{P}(B \,|\, A_i) \mathbb{P}(A_i) \]
- e.g. the probability of being unemployed is the probability of being unemployed and getting a cash transfer plus the probability of being unemployed and not getting a cash transfer

Statistical Independence

Random variables $X \sim p$ and $Y \sim g$ are statistically independent if

\[ \mathbb{P}(X=i,Y=j)=p_i g_j,\text{ for all } i, j \]

i.e., the joint distribution is the product of the marginal distributions
e.g., the probability you were given a conditional cash transfer AND you were unemployed is probability you were given a conditional cash transfer $\times$ the probability you were unemployed

Conditional Distributions and Independence

When $X$ and $Y$ are independent, use the definitions of conditional and marginal distributions

\[ \begin{aligned} \mathbb{P}(X=i\,|\,Y=j) &=\frac{\mathbb{P}(X=i, Y=j)}{\mathbb{P}(Y=j)} =\frac{p_ig_j}{\sum_{i=1}^{I}p_ig_j}=\frac{p_ig_j}{g_j}=p_i \\ \mathbb{P}(Y=j\,|\,X=i) &=\frac{\mathbb{P}(X=i, Y=j)}{\mathbb{P}(X=i)} =\frac{p_ig_j}{\sum_{j=1}^{J}p_ig_j}=\frac{p_ig_j}{p_i}=g_j \end{aligned} \] - i.e, independent $X$ and $Y$ implies the conditional distributions are the marginals

$\mathbb{P}(X=i\,|\,Y=j) = \mathbb{P}(X=i)$ and $\mathbb{P}(Y=j\,|\,X=i) = \mathbb{P}(Y=j)$

Notation for (Conditional) Independence

Let $X, Y, Z$ be random variables
Common notation for independence is

\[ \begin{aligned} X &\perp Y\\ \mathbb{P}(X=x, Y=y) &= \mathbb{P}(X=x) \mathbb{P}(Y=y) \end{aligned} \]
Common notation for conditional independence

\[ \begin{aligned} X &\perp\!\!\!\perp Y \,|\, Z\\ \mathbb{P}(X=x, Y=y | Z=z) &= \mathbb{P}(X=x | Z=z) \mathbb{P}(Y=y | Z=z) \end{aligned} \]
Central to causal inference and treatment effects

Classic Example of Conditional Independence

Let $X$ be the number of cigarettes smoked per day
Let $Y$ be the number of years of life remaining
Let $Z$ be the number of years of smoking
Then $X \perp\!\!\!\perp Y \,|\, Z$
- i.e., the number of cigarettes smoked per day is independent of the number of years of life remaining given the number of years of smoking
- i.e., the number of cigarettes smoked per day is independent of the number of years of life remaining given the number of years of smoking

Simpson’s “Paradox”

Simpson’s paradox is a warning on composition effects
Recall the law of total probability $\mathbb{P}(X = x| Y=y) = \sum_{z} \mathbb{P}(X = x| Y=y, Z=z) \mathbb{P}(Z=z| Y=y)$
Lets say you see $\mathbb{P}(X| Y=y_1) > \mathbb{P}(X| Y=y_2)$
- Might suggests positive relationship on $X$ and $Y$?
If $\mathbb{P}(X| Y=y_1, Z=z) < \mathbb{P}(X| Y=y_2, Z=z)$ for many $z$ then
- $X$ and $Y$ may have a negative relationship after conditioning on $Z$?

UC Berkeley Gender Bias: Overall Data

Total Applicants	Admitted	Men Applicants	Men Admitted	Women Applicants	Women Admitted
12,763	41%	8,442	44%	4,321	35%

Classic example is the Berkeley Gender Bias is a classic example of Simpson’s paradox
But if you look at individual departments the results are different
Seemed to show that 4 out of 85 departments had significant bias against women and 6 significant bias against men
But the biggest difference was in which departments women applied to
The following shows the top 6 departments to get a sense of heterogeneity

Conditional Probabilities for 6 Largest Departments

Dept	All Applicants	Admitted	Men Applicants	Men Admitted	Women Applicants	Women Admitted
A	933	64%	825	62%	108	82%
B	585	63%	560	63%	25	68%
C	918	35%	325	37%	593	34%
D	792	34%	417	33%	375	35%
E	584	25%	191	28%	393	24%
F	714	6%	373	6%	341	7%

Explanation Using Conditional Probabilities

Overall, $\mathbb{P}(\text{Admitted | Men}) = 0.44$ and $\mathbb{P}(\text{Admitted | Women}) = 0.35$

But this is different when conditioning on departments!

$\mathbb{P}(\text{Admitted | Men, A}) = 0.62$, $\mathbb{P}(\text{Admitted | Women, A}) = 0.82$
$\mathbb{P}(\text{Admitted | Men, B}) = 0.63$, $\mathbb{P}(\text{Admitted | Women, B}) = 0.68$
“Paradox” because women tend to apply to more competitive departments

Does this Old Data Imply There was No Bias?

All data requires assumptions to interpret! Most assumptions are implicit, so you need to reflect on what assumptions you may have made
This simply corrected for the mechanical composition effect
Interpreting bias better requires reflecting on your “model” and assumptions
- Is average quality is identical conditional on department and gender? Especially in 1973 when there was enormous selection bias?
- What if bias leads women to apply to the more competitive departments?

Bayes’ Law

Conditional probability is used for Bayes’ Law:

\[ \mathbb{P}(A \,|\, B)=\frac{\mathbb{P}(B \,|\, A)\mathbb{P}(A)}{\mathbb{P}(B)} \]

Sometimes:

$\mathbb{P}(B\,|\,A)$ is called the “likelihood”
$\mathbb{P}(A)$ is called the “prior”
$\mathbb{P}(A\,|\,B)$ is called the “posterior”
$\mathbb{P}(B)$ is called the “marginal likelihood”, which normalizes the expression

Example with Bayes’ Law

$A$ is the event of being unemployed, $B$ is the event of getting a cash transfer

$\mathbb{P}(B\,|\,A)$ is the probability of being given a cash transfer given you were unemployed
$\mathbb{P}(A)$ is the probability of being unemployed within the whole distribution
$\mathbb{P}(A\,|\,B)$ is the probability of being unemployed given you were given a cash transfer
$\mathbb{P}(B)$ is the probability of being given a cash transfer within the whole distribution
Bayes’ law: probability of being unemployed given you were given a cash transfer $\propto$ probability of being given cash transfer given you were unemployed $\times$ probability of being unemployed

Bayes Law with Bivariate Random Variables

For discrete bi-variate random variables, we can write Bayes’ Law as

\[ \mathbb{P}(X=i\,|\,Y=j)=\frac{\mathbb{P}(X=i,Y=j)}{\mathbb{P}(Y=j)}=\frac{\mathbb{P}(Y=j\,|\,X=i)\mathbb{P}(X=i)}{\mathbb{P}(Y=j)} \]
If $X$ and $Y$ are independent
- $\mathbb{P}(Y=j\,|\,X=i) = \mathbb{P}(Y=j)$
- Bayes’ Law simplifies to just the marginal distribution
\[ \mathbb{P}(X=i\,|\,Y=j) = \mathbb{P}(X=i) \]

Calculating Marginal Distributions

Lets create a bivariate $\mathbb{P}(X=i, Y=j)$ with $I=5$ and $J=4$
Use matrix $P$ to calculate $\mathbb{P}(X=i)$ and $\mathbb{P}(Y=j)$

np.set_printoptions(precision=3)
P = np.array([[0.05, 0.07, 0.02, 0.01],
              [0.04, 0.1,  0.06, 0.03],
              [0.08, 0.09, 0.07, 0.04],
              [0.02, 0.03, 0.02, 0.01],
              [0.09, 0.08, 0.04, 0.05]])

print(f"sum = 1? {np.isclose(P.sum(), 1.0)}")
print(f"p_ij >= 0? {np.all(P >= 0)}")
margin_x = P.sum(axis=1) # sum over j
margin_y = P.sum(axis=0) # sum over i
print(f"P(X=i) = {margin_x}")
print(f"Sum_i P(X=i) = {margin_x.sum()}")
print(f"P(Y=j) = {margin_y}")
print(f"Sum_j P(Y=j) = {margin_y.sum()}")

sum = 1? True
p_ij >= 0? True
P(X=i) = [0.15 0.23 0.28 0.08 0.26]
Sum_i P(X=i) = 0.9999999999999999
P(Y=j) = [0.28 0.37 0.21 0.14]
Sum_j P(Y=j) = 1.0

Calculating Conditional Distributions

Now use $P$ to calculate $\mathbb{P}(X=i\,|\,Y=j)$, etc.

print(f"P(X=i|Y=1)=\n{P[:,0] / margin_y[0]}\,")
cond_x_y = np.row_stack(
  [P[:,i] / margin_y[i] for i in range(4)])
# or (P / margin_y[np.newaxis, :]).T
cond_y_x = np.row_stack(
  [P[j,:] / margin_x[j] for j in range(5)])
# or (P.T / margin_x[np.newaxis, :]).T
print(f"P(X=i|Y=2)=\n{cond_x_y[:, 1]}")
print(f"sum_i P(X=i|Y=2)=\
{cond_x_y[1,:].sum():.2f}")
print(f"P(Y=j|X=1)=\n{cond_y_x[:, 0]}")

P(X=i|Y=1)=
[0.179 0.143 0.286 0.071 0.321]\,
P(X=i|Y=2)=
[0.143 0.27  0.286 0.214]
sum_i P(X=i|Y=2)=1.00
P(Y=j|X=1)=
[0.333 0.174 0.286 0.25  0.346]

Check Bayes’ Law

\[ \mathbb{P}(X=1\,|\,Y=2) = \frac{\mathbb{P}(Y=2\,|\,X=1)\mathbb{P}(X=1)}{\mathbb{P}(Y=2)} \]

x = 1
y = 2
p_y_x = cond_y_x[x-1, y-1]
p_x = margin_x[x-1]
p_y = margin_y[y-1]
p_x_y = cond_x_y[y-1, x-1]
p_bayes = p_y_x * p_x / p_y
print(f"P(Y={y}|X={x}) = {p_y_x:.2g}")
print(f"P(X={x}) = {p_x:.2g}")
print(f"P(Y={y}) = {p_y:.2g}")
print(f"P(X={x}|Y={y})={p_x_y:.2g}")
print(f"P(Y={y}|X={x})P(X={x})\
/P(Y={y})={p_bayes:.2g}")

P(Y=2|X=1) = 0.47
P(X=1) = 0.15
P(Y=2) = 0.37
P(X=1|Y=2)=0.19
P(Y=2|X=1)P(X=1)/P(Y=2)=0.19

Conditional Expectations

Conditional Expectation

Recall: $\mathbb{P}(X=i\,|\,Y=j)$ is itself a probability distribution if we vary $j$
A conditional expectation is an expectation using the conditional probability distribution. For a discrete random variable $X$ and $Y$, \[ \mathbb{E}[X\,|\,Y=j] = \sum_{i=1}^{I} i\, \mathbb{P}(X=i\,|\,Y=j) \]
If $X$ and $Y$ are independent then
- Recall that $\mathbb{P}(X=i\,|\,Y=j) = \mathbb{P}(X=i)$
- Which implies $\mathbb{E}[X\,|\,Y=j] = \mathbb{E}[X]$
- That the expected value of $X$ is the same regardless of the value of $Y$

Key Properties of Expectations

Let $A$ and $B$ be scalar/vector/matrix constants, and $X$ and $Y$ are scalar/vector/matrix random variables
Expectations are linear operators, which gives us some useful properties
- $\mathbb{E}[A X + B Y] = A \mathbb{E}[X] + B \mathbb{E}[Y]$
$\mathbb{E}[X Y] \neq \mathbb{E}[X] \mathbb{E}[Y]$ in general
- But if $X$ and $Y$ are independent, then $\mathbb{E}[X Y] = \mathbb{E}[X] \mathbb{E}[Y]$
$\mathbb{E}[f(X)] \neq f(\mathbb{E}[X])$ in geneal
- Unless $f(\cdot)$ is linear or if $X$ is degenerate (i.e., a constant)
Jensen’s Inequality: If $f(\cdot)$ is a convex function, then $\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])$

Law of Total Expectations

Let $\{A_1, \ldots, A_N\}$ be a partition of $\Omega$. For any random variable $X$,
Law of Total Expectations

\[ \mathbb{E}[X] = \sum_{i=1}^{N} \mathbb{E}[X \,|\, A_i] \mathbb{P}(A_i) \]

e.g. the expected value of income is the expected value of income given you were unemployed times the probability of being unemployed plus the expected value of income given you were employed times the probability of being employed

Conditional Expectations and Iterated Expectations

Same properties all hold e.g. $\mathbb{E}[A X + B Y \,|\, Z] = A \mathbb{E}[X \,|\, Z] + B \mathbb{E}[Y \,|\, Z]$
Conditional expectations are themselves random variables if the conditional is. e.g. $\mathbb{E}[X \,|\, Y]$ is a random variable in $Y$
Law of Iterated Expectations

\[ \mathbb{E}\left[\mathbb{E}[X\,|\,Y]\right] = \mathbb{E}[X] \]
- The expected value of $X$ is the average of the conditional expectations of $X$ given $Y$ over the distribution of $Y$
- Similarly for conditionals: $\mathbb{E}\left[\mathbb{E}[X\,|\,Y, Z]\, |\, Z\right] = \mathbb{E}[X\,|\,Z]$

Calculating Conditional Expectations

Assign an RV to each state then find $\mathbb{E}[X\,|\,Y=1]$

P = np.array([[0.05, 0.07, 0.02, 0.01],
              [0.04, 0.1,  0.06, 0.03],
              [0.08, 0.09, 0.07, 0.04],
              [0.02, 0.03, 0.02, 0.01],
              [0.09, 0.08, 0.04, 0.05]])
margin_x = P.sum(axis=1)
margin_y = P.sum(axis=0)
cond_x_y = (P / margin_y[np.newaxis, :]).T
cond_y_x = (P.T / margin_x[np.newaxis, :]).T
# Give RV values to states
vals_x = np.arange(P.shape[0]) + 1
vals_y = np.arange(P.shape[1]) + 1

print("E(X | Y = 1) =",
  np.sum([vals_x[i]*cond_x_y[0,i]
          for i in range(0,5)]))

E(X | Y = 1) = 3.2142857142857144

Conditional Expectations and the Law of Iterated Expectations

E_x_y = np.array([
  np.sum([vals_x[i]*cond_x_y[j,i]
          for i in range(0,5)])
  for j in range(0,4)])
E_y_x = np.array([
  np.sum([vals_y[j]*cond_y_x[i,j]
          for j in range(0,4)])
  for i in range(0,5)])      
# Or use np broadcasting with *
E_x_y = np.sum(vals_x * cond_x_y, axis=1)
E_y_x = np.sum(vals_y * cond_y_x, axis=1)
print("E(X | Y = j) =", E_x_y)
print("E(Y | X = i) =", E_y_x)
print(f"E(X) = {vals_x @ margin_x:.3g},\
  E(E(X|Y)) = {E_x_y @ margin_y}")

E(X | Y = j) = [3.214 2.865 3.    3.429]
E(Y | X = i) = [1.933 2.348 2.25  2.25  2.192]
E(X) = 3.07,  E(E(X|Y)) = 3.0700000000000003

Tips for using Numpy Broadcasting

Don’t
- Loops, list comprehensions (e.g., [x[i, i+1] for i in range(5)]), or a combination are usually clearer and often fast enough
Write the slow version first
Ask Github Copilot or ChatGPT to do a numpy broadcasting version
Test it for a few values! Easy to make mistakes
For more advanced usage you may be working in a ML library. If so, then packages such as jax.vmap of torch.vmap help

ECON526: Quantitative Economics with Data Science Applications

Overview

Summary

Probability

Definitions

Probability Space

Probability Measure

Random Variables

Axioms of Probability

Discrete Distributions

Discrete Distributions

Random Variables

PDF and CDF

Expectation

Expectations and Linear Algebra

Example with a Discrete Distribution

Using the discrete_rv

Histogram \(N=50\)

Histogram \(N=500\)

The Binomial Distribution

The Binomial Probability Mass Function

The Binomial Cumulative Distribution Function

Histogram \(N=50\)

Histogram \(N=5000\)

LLN and CLT

Law of Large Numbers (LLN)

Visualizing the LLN with Gaussian RVs \(N=20\)

Visualizing the LLN with Gaussian RVs \(N=20\)

Visualizing the LLN with Gaussian RVs \(N=100\)

Visualizing the LLN with Gaussian RVs \(N=100\)

Pareto Distributions

Visualizing the Sample Means for a Pareto Distribution

Visualizing the Sample Means for a Pareto Distribution

Central Limit Theorem (CLT)

Visualizing the CLT with Exponential Distributions

CLT of Exponential to \(N=100\)

CLT of Exponential to \(N=2000\)

CLT of Exponential to \(N=2000\)

Joint Distributions

Joint Probability Distributions

Bivariate Probability Distributions

Marginal Probability Distributions

Conditional Probability

Conditional Distributions

Law of Total Probability

Statistical Independence

Conditional Distributions and Independence

Notation for (Conditional) Independence

Classic Example of Conditional Independence

Simpson’s “Paradox”

UC Berkeley Gender Bias: Overall Data

Conditional Probabilities for 6 Largest Departments

Explanation Using Conditional Probabilities

Does this Old Data Imply There was No Bias?

Bayes’ Law

Example with Bayes’ Law

Bayes Law with Bivariate Random Variables

Calculating Marginal Distributions

Calculating Conditional Distributions

Check Bayes’ Law

Conditional Expectations

Conditional Expectation

Key Properties of Expectations

Law of Total Expectations

Conditional Expectations and Iterated Expectations

Calculating Conditional Expectations

Conditional Expectations and the Law of Iterated Expectations

Tips for using Numpy Broadcasting

Using the `discrete_rv`