Probability and Uncertainty
To formalize probability always be careful to separate
Probability space is a \((\Omega, \mathcal{A})\):
Set \(\Omega\) of possible outcomes and \(\omega \in \Omega\) is a particular outcome
Subsets \(A \subseteq \Omega\) are events
\setminus
) is event of not being eitherThe collection of all possible events is \(\mathcal{A}\) where \(A \in \mathcal{A}\)
Probability Measure is a function which assigns a numerical value on the likelihood of an event
For us, \(\mathbb{P} : \mathcal{A} \rightarrow [0,1]\)
Will see denoted as a function, \(\mu(A)\) for integrals in advanced uses
Random Variable: \(X(\omega)\) assigns a numerical value to a particular outcome
\(X : \Omega \rightarrow \mathbb{R}\), but could be vector or matrix valued
Or \(X(\omega = E) = \$40,000\) if employed, \(X(\omega = U) = \$15,000\) if unemployed. Useful for finding average incomes
Random variables defined on \(\Omega\), and inherit the probability measure
Probability measure \(\mathbb{P}\) on probability space \((\Omega, \mathcal{A})\) must satisfy axioms:
These imply other results such as:
A discrete probability spaces have finite (or countable) number of outcomes
When convenient, we can number the outcomes arbitrarily as \(n = 1,\ldots N\) (or \(\infty\)) and then work with \(\Omega = \{1, \ldots N\}\) and \(\omega \in \Omega\)
Axioms especially simple because we use \(\mathbb{P}(\omega = n) = p_n\),
Notation can become a little confusing because we will sometimes use the same index number for the event and for the random value, but they are separate!
Frequently we will assign the random variable as just that index
Other times we may want to associate a value with each outcome
Probability Mass Function (PMF) is the probability of a single outcome for random variable \(X\). Will assume \(X\) itself has discrete values \[ p_n \equiv \mathbb{P}(X = n) \]
Cumulative Distribution Function (CDF) is the probability of all outcomes less than or equal to a particular outcome. \[ \mathbb{P}(X \leq n) = \sum\limits_{i=1}^{n} p_i \]
Vectors can help with the accounting and notation of expectations. Let
\[ \begin{aligned} \mathbb{E}[X] &= \sum_{n=1}^N x_n \mathbb{P}(X = x_n) = p \cdot x = p^{\top} x\\ \mathbb{E}[f(X)] &= \sum_{n=1}^N f(x_n) \mathbb{P}(X = x_n) = p \cdot f(x) = p^{\top} f(x) \end{aligned} \]
E(X) = 34500.0
E(f(X)) = 182.2474487139159
CDF(X) = [0.1 0.9 1. ]
Note that the CDF was easy to calculate as cumulative sums. Interpretable?
discrete_rv
scipy.stats
has a discrete_rv
type with built-in functionsx
valuesE(X) = 34500.0
E(f(X)) = 182.2474487139159
CDF(X) = [0.1 0.2 1. ]
Samples of X = [15000 40000 40000 40000 40000]
For \(n = 1 \ldots N\), the binomial distribution is defined by the PMF
\[ \displaystyle \begin{aligned} \mathbb{P}(X = n) &= \binom{N}{n} \theta^n (1-\theta)^{N-n}\\ \mathbb{E}(X) &= \sum_{n=0}^N n \binom{N}{n} \theta^n (1-\theta)^{N-n} = N \theta \end{aligned} \]
\[ \mathbb{P}(X \leq n) = \sum_{i=0}^n \binom{N}{i} \theta^i (1-\theta)^{N-i} \]
A classic LLN is the Strong Law of Large Numbers
Take a sequence of independent, identically distributed random variables \(X_1, X_2, \ldots\) with \(\mathbb{E}[X_i] = \mu\) and \(\mathbb{V}[X_i] = \sigma^2 < \infty\). Then,
Powerful and frequently used, but remember assumptions!
N = 20 # Number of samples
mu, sigma = 0, 1
np.random.seed(42)
samples = np.random.normal(mu, sigma, N)
sample_means = np.cumsum(samples) / np.arange(1, N + 1)
plt.scatter(range(1, N + 1), samples, label='Individual Samples', alpha=0.6, s=10)
plt.plot(range(1, N + 1), sample_means, label='Sample Mean', linewidth=2)
plt.axhline(mu, color='r', linestyle='--', label='True Mean')
for n in range(N): # add lines to samples from sample mean
plt.plot([n + 1, n + 1], [sample_means[n], samples[n]], color='gray', linewidth=0.5, alpha=0.6)
plt.xlabel('Sample Number $N$')
plt.legend()
plt.show()
N = 100 # Number of samples
mu, sigma = 0, 1
np.random.seed(42)
samples = np.random.normal(mu, sigma, N)
sample_means = np.cumsum(samples) / np.arange(1, N + 1)
plt.scatter(range(1, N + 1), samples, label='Individual Samples', alpha=0.6, s=10)
plt.plot(range(1, N + 1), sample_means, label='Sample Mean', linewidth=2)
plt.axhline(mu, color='r', linestyle='--', label='True Mean')
for n in range(N): # add lines to samples from sample mean
plt.plot([n + 1, n + 1], [sample_means[n], samples[n]], color='gray', linewidth=0.5, alpha=0.6)
plt.xlabel('Sample Number $N$')
plt.legend()
plt.show()
Pareto distributions are a family of distributions with a power-law tail
Parameterized by \((x_m, \alpha)\) with the PDF \[ p(x) = \frac{\alpha x_m^{\alpha}}{x^{\alpha+1}} \]
The mean is \(\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}\) for \(\alpha > 1\)
The variance is \(\mathbb{V}[X] = \frac{\alpha x_m^2}{(\alpha - 1)^2(\alpha - 2)}\) for \(\alpha > 2\)
N = 100 # Number of samples
alpha = 1.5
np.random.seed(42)
dist = scipy.stats.pareto(alpha)
samples = dist.rvs(size=N)
sample_means = np.cumsum(samples) / np.arange(1, N + 1)
plt.scatter(range(1, N + 1), samples, label='Individual Samples', alpha=0.6, s=10)
plt.plot(range(1, N + 1), sample_means, label='Sample Mean', linewidth=2)
plt.axhline(dist.mean(), color='r', linestyle='--', label='True Mean')
for n in range(N): # add lines to samples from sample mean
plt.plot([n + 1, n + 1], [sample_means[n], samples[n]], color='gray', linewidth=0.5, alpha=0.6)
plt.xlabel('Sample Number $N$')
plt.legend()
plt.show()
The Central Limit Theorem (CLT) is a classic result in statistics
Again, lets assume we have IID observations with \(\mathbb{E}[X_i] = \mu\) and \(\mathbb{V}[X_i] = \sigma^2 < \infty\)
Define the sample mean \(\bar{X}_N \equiv \frac{1}{N} \sum_{i=1}^N X_i\)
Then the CLT is
\[ \sqrt{n} \left( \bar{X}_n - \mu \right) \xrightarrow{d} \mathcal{N}(0, \sigma^2) \]
fig, ax = plt.subplots()
def update(n):
ax.clear()
data = distribution.rvs((5000, n))
sample_means = data.mean(axis=1)
Y = np.sqrt(n) * (sample_means - mu)
ax.set_xlim(-3 * s, 3 * s)
ax.set_ylim(0, 0.5)
ax.hist(Y, bins=60, alpha=0.5,
density=True)
ax.set_title(f"CLT for $N = {n}$")
ani = FuncAnimation(fig,update,
frames=range(1, 100, 5),
interval=500,blit=False, repeat=False)
plt.close()
IPython.display.HTML(ani.to_html5_video())
fig, ax = plt.subplots()
def update(n):
ax.clear()
data = distribution.rvs((10000, n))
sample_means = data.mean(axis=1)
Y = np.sqrt(n) * (sample_means - mu)
ax.set_xlim(-3 * s, 3 * s)
ax.set_ylim(0, 0.5)
ax.hist(Y, bins=60, alpha=0.5, density=True)
ax.set_title(f"CLT for $N = {n}$")
x = np.linspace(-3 * s, 3 * s, 100)
ax.plot(x, scipy.stats.norm.pdf(x, 0, s), 'r-', lw=2, alpha=0.7, label='N(0, 1)')
ani = FuncAnimation(fig,update,frames=range(1, 2000, 250), interval=500,blit=False, repeat=False)
plt.close()
IPython.display.HTML(ani.to_html5_video())
Let \(X,Y\) be two discrete random variables that take values:
\[ X\in\{1,\ldots,I\},\quad Y\in\{1,\ldots,J\} \]
Then their joint distribution is described by a matrix
\[ P\equiv[\mathbb{P}(X=i,Y=j)]_{i=1\ldots I, j=1,\ldots J}\in \mathbb{R}^{I\times J} \]
Which fulfills the key axioms of probability
\[ \begin{aligned} p_{ij}\equiv\mathbb{P}(X=i,Y=j) &\geq 0\\ \sum_{i=1}^{I}\sum_{j=1}^{J}p_{ij}=1 \end{aligned} \]
The joint distribution induces marginal distributions
\[ \begin{aligned} \mathbb{P}(X=i)& = \sum_{j=1}^{J}p_{ij} = \mu_i, \quad i=1,\ldots,I\\ \mathbb{P}(Y=j)&= \sum_{i=1}^{I}p_{ij} = \nu_j, \quad j=1,\ldots,J \end{aligned} \]
The marginal distributions are also probability distributions
Conditional probabilities are defined according to
\[ \mathbb{P}(A \,|\, B)=\frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \]
\(A \cap B\) is the event that both \(A\) and \(B\) occur, i.e., the intersection
The conditional probability is the probability of \(A\) given \(B\) has occurred
For a pair of discrete random variables, we have the conditional distribution
\[ \mathbb{P}(X=i|Y=j)=\frac{p_{ij}}{\sum_{i=1}^{I}p_{ij}} =\frac{\mathbb{P}(X=i, Y=j)}{\mathbb{P}(Y=j)} \]
\[ \sum_{i=1}^{I}\mathbb{P}(X=i\,|\,Y=j) =\frac{\sum_{i=1}^{I}p_{ij}}{\sum_{i=1}^{I}p_{ij}}=1 \]
\[ \mathbb{P}(X=i,Y=j)=p_i g_j,\text{ for all } i, j \]
\[ \begin{aligned} \mathbb{P}(X=i\,|\,Y=j) &=\frac{\mathbb{P}(X=i, Y=j)}{\mathbb{P}(Y=j)} =\frac{p_ig_j}{\sum_{i=1}^{I}p_ig_j}=\frac{p_ig_j}{g_j}=p_i \\ \mathbb{P}(Y=j\,|\,X=i) &=\frac{\mathbb{P}(X=i, Y=j)}{\mathbb{P}(X=i)} =\frac{p_ig_j}{\sum_{j=1}^{J}p_ig_j}=\frac{p_ig_j}{p_i}=g_j \end{aligned} \] - i.e, independent \(X\) and \(Y\) implies the conditional distributions are the marginals
Let \(X, Y, Z\) be random variables
Common notation for independence is
\[ \begin{aligned} X &\perp Y\\ \mathbb{P}(X=x, Y=y) &= \mathbb{P}(X=x) \mathbb{P}(Y=y) \end{aligned} \]
Common notation for conditional independence
\[ \begin{aligned} X &\perp\!\!\!\perp Y \,|\, Z\\ \mathbb{P}(X=x, Y=y | Z=z) &= \mathbb{P}(X=x | Z=z) \mathbb{P}(Y=y | Z=z) \end{aligned} \]
Central to causal inference and treatment effects
Total Applicants | Admitted | Men Applicants | Men Admitted | Women Applicants | Women Admitted |
---|---|---|---|---|---|
12,763 | 41% | 8,442 | 44% | 4,321 | 35% |
Dept | All Applicants | Admitted | Men Applicants | Men Admitted | Women Applicants | Women Admitted |
---|---|---|---|---|---|---|
A | 933 | 64% | 825 | 62% | 108 | 82% |
B | 585 | 63% | 560 | 63% | 25 | 68% |
C | 918 | 35% | 325 | 37% | 593 | 34% |
D | 792 | 34% | 417 | 33% | 375 | 35% |
E | 584 | 25% | 191 | 28% | 393 | 24% |
F | 714 | 6% | 373 | 6% | 341 | 7% |
Overall, \(\mathbb{P}(\text{Admitted | Men}) = 0.44\) and \(\mathbb{P}(\text{Admitted | Women}) = 0.35\)
But this is different when conditioning on departments!
Conditional probability is used for Bayes’ Law:
\[ \mathbb{P}(A \,|\, B)=\frac{\mathbb{P}(B \,|\, A)\mathbb{P}(A)}{\mathbb{P}(B)} \]
Sometimes:
\(A\) is the event of being unemployed, \(B\) is the event of getting a cash transfer
For discrete bi-variate random variables, we can write Bayes’ Law as
\[ \mathbb{P}(X=i\,|\,Y=j)=\frac{\mathbb{P}(X=i,Y=j)}{\mathbb{P}(Y=j)}=\frac{\mathbb{P}(Y=j\,|\,X=i)\mathbb{P}(X=i)}{\mathbb{P}(Y=j)} \]
If \(X\) and \(Y\) are independent
\[ \mathbb{P}(X=i\,|\,Y=j) = \mathbb{P}(X=i) \]
np.set_printoptions(precision=3)
P = np.array([[0.05, 0.07, 0.02, 0.01],
[0.04, 0.1, 0.06, 0.03],
[0.08, 0.09, 0.07, 0.04],
[0.02, 0.03, 0.02, 0.01],
[0.09, 0.08, 0.04, 0.05]])
print(f"sum = 1? {np.isclose(P.sum(), 1.0)}")
print(f"p_ij >= 0? {np.all(P >= 0)}")
margin_x = P.sum(axis=1) # sum over j
margin_y = P.sum(axis=0) # sum over i
print(f"P(X=i) = {margin_x}")
print(f"Sum_i P(X=i) = {margin_x.sum()}")
print(f"P(Y=j) = {margin_y}")
print(f"Sum_j P(Y=j) = {margin_y.sum()}")
sum = 1? True
p_ij >= 0? True
P(X=i) = [0.15 0.23 0.28 0.08 0.26]
Sum_i P(X=i) = 0.9999999999999999
P(Y=j) = [0.28 0.37 0.21 0.14]
Sum_j P(Y=j) = 1.0
print(f"P(X=i|Y=1)=\n{P[:,0] / margin_y[0]}\,")
cond_x_y = np.row_stack(
[P[:,i] / margin_y[i] for i in range(4)])
# or (P / margin_y[np.newaxis, :]).T
cond_y_x = np.row_stack(
[P[j,:] / margin_x[j] for j in range(5)])
# or (P.T / margin_x[np.newaxis, :]).T
print(f"P(X=i|Y=2)=\n{cond_x_y[:, 1]}")
print(f"sum_i P(X=i|Y=2)=\
{cond_x_y[1,:].sum():.2f}")
print(f"P(Y=j|X=1)=\n{cond_y_x[:, 0]}")
P(X=i|Y=1)=
[0.179 0.143 0.286 0.071 0.321]\,
P(X=i|Y=2)=
[0.143 0.27 0.286 0.214]
sum_i P(X=i|Y=2)=1.00
P(Y=j|X=1)=
[0.333 0.174 0.286 0.25 0.346]
\[ \mathbb{P}(X=1\,|\,Y=2) = \frac{\mathbb{P}(Y=2\,|\,X=1)\mathbb{P}(X=1)}{\mathbb{P}(Y=2)} \]
x = 1
y = 2
p_y_x = cond_y_x[x-1, y-1]
p_x = margin_x[x-1]
p_y = margin_y[y-1]
p_x_y = cond_x_y[y-1, x-1]
p_bayes = p_y_x * p_x / p_y
print(f"P(Y={y}|X={x}) = {p_y_x:.2g}")
print(f"P(X={x}) = {p_x:.2g}")
print(f"P(Y={y}) = {p_y:.2g}")
print(f"P(X={x}|Y={y})={p_x_y:.2g}")
print(f"P(Y={y}|X={x})P(X={x})\
/P(Y={y})={p_bayes:.2g}")
P(Y=2|X=1) = 0.47
P(X=1) = 0.15
P(Y=2) = 0.37
P(X=1|Y=2)=0.19
P(Y=2|X=1)P(X=1)/P(Y=2)=0.19
Recall: \(\mathbb{P}(X=i\,|\,Y=j)\) is itself a probability distribution if we vary \(j\)
A conditional expectation is an expectation using the conditional probability distribution. For a discrete random variable \(X\) and \(Y\), \[ \mathbb{E}[X\,|\,Y=j] = \sum_{i=1}^{I} i\, \mathbb{P}(X=i\,|\,Y=j) \]
If \(X\) and \(Y\) are independent then
Let \(A\) and \(B\) be scalar/vector/matrix constants, and \(X\) and \(Y\) are scalar/vector/matrix random variables
Expectations are linear operators, which gives us some useful properties
\(\mathbb{E}[X Y] \neq \mathbb{E}[X] \mathbb{E}[Y]\) in general
\(\mathbb{E}[f(X)] \neq f(\mathbb{E}[X])\) in geneal
Jensen’s Inequality: If \(f(\cdot)\) is a convex function, then \(\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])\)
\[ \mathbb{E}[X] = \sum_{i=1}^{N} \mathbb{E}[X \,|\, A_i] \mathbb{P}(A_i) \]
Same properties all hold e.g. \(\mathbb{E}[A X + B Y \,|\, Z] = A \mathbb{E}[X \,|\, Z] + B \mathbb{E}[Y \,|\, Z]\)
Conditional expectations are themselves random variables if the conditional is. e.g. \(\mathbb{E}[X \,|\, Y]\) is a random variable in \(Y\)
Law of Iterated Expectations
\[ \mathbb{E}\left[\mathbb{E}[X\,|\,Y]\right] = \mathbb{E}[X] \]
Assign an RV to each state then find \(\mathbb{E}[X\,|\,Y=1]\)
P = np.array([[0.05, 0.07, 0.02, 0.01],
[0.04, 0.1, 0.06, 0.03],
[0.08, 0.09, 0.07, 0.04],
[0.02, 0.03, 0.02, 0.01],
[0.09, 0.08, 0.04, 0.05]])
margin_x = P.sum(axis=1)
margin_y = P.sum(axis=0)
cond_x_y = (P / margin_y[np.newaxis, :]).T
cond_y_x = (P.T / margin_x[np.newaxis, :]).T
# Give RV values to states
vals_x = np.arange(P.shape[0]) + 1
vals_y = np.arange(P.shape[1]) + 1
print("E(X | Y = 1) =",
np.sum([vals_x[i]*cond_x_y[0,i]
for i in range(0,5)]))
E(X | Y = 1) = 3.2142857142857144
E_x_y = np.array([
np.sum([vals_x[i]*cond_x_y[j,i]
for i in range(0,5)])
for j in range(0,4)])
E_y_x = np.array([
np.sum([vals_y[j]*cond_y_x[i,j]
for j in range(0,4)])
for i in range(0,5)])
# Or use np broadcasting with *
E_x_y = np.sum(vals_x * cond_x_y, axis=1)
E_y_x = np.sum(vals_y * cond_y_x, axis=1)
print("E(X | Y = j) =", E_x_y)
print("E(Y | X = i) =", E_y_x)
print(f"E(X) = {vals_x @ margin_x:.3g},\
E(E(X|Y)) = {E_x_y @ margin_y}")
E(X | Y = j) = [3.214 2.865 3. 3.429]
E(Y | X = i) = [1.933 2.348 2.25 2.25 2.192]
E(X) = 3.07, E(E(X|Y)) = 3.0700000000000003
[x[i, i+1] for i in range(5)]
), or a combination are usually clearer and often fast enough