Double / Debiased Machine Learning

ECON526

Paul Schrimpf

University of British Columbia

Introduction

\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \def\inprob{{\,{\buildrel p \over \rightarrow}\,}} \def\indist{\,{\buildrel d \over \rightarrow}\,} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

Introduction

Many estimation tasks involve:
- Low dimensional parameter of interest
- High dimensional nuisance parameters needed to recover parameter of interest
Example: in matching
- Interested in \(ATE = \Er[Y(1) - Y(0)]\)
- Nuisance parameters \(\Er[Y|D,X]\) and/or \(P(D=1|X)\)

Machine Learning for Nuisance Parameters

In matching we said that we could use flexible machine learning estimators for \(\Er[Y|D,X]\) and \(P(D=1|X)\) and plug them into the doubly robust estimator
We will cover:
- Some of the technical details behind this idea
- How this idea can be applied to other estimators

These notes will examine the incorportion of machine learning methods in classic econometric techniques for estimating causal effects. More specifally, we will focus on estimating treatment effects using matching and instrumental variables. In these estimators (and many others) there is a low-dimensional parameter of interest, such as the average treatment effect, but estimating it requires also estimating a potentially high dimensional nuisance parameter, such as the propensity score. Machine learning methods were developed for prediction with high dimensional data. It is then natural to try to use machine learning for estimating high dimensional nuisance parameters. Care must be taken when doing so though because the flexibility and complexity that make machine learning so good at prediction also pose challenges for inference.

Example: partially linear model

\[ y_i = \theta d_i + f(x_i) + \epsilon_i \]

Interested in \(\theta\)
Assume \(\Er[\epsilon|d,x] = 0\)
Nuisance parameter \(f()\)

Example: Matching

Binary treatment \(d_i \in \{0,1\}\)
Potential outcomes \(y_i(0), y_i(1)\), observe \(y_i = y_i(d_i)\)
Interested in average treatment effect : \(\theta = \Er[y_i(1) - y_i(0)]\)
Covariates \(x_i\)
Assume unconfoundedness : \(d_i \indep y_i(1), y_i(0) | x_i\)

Example: Matching

Estimatable formulae for ATE : \[ \begin{align*} \theta = & \Er\left[\frac{y_i d_i}{\Pr(d = 1 | x_i)} - \frac{y_i (1-d_i)}{1-\Pr(d=1|x_i)} \right] \\ \theta = & \Er\left[\Er[y_i | d_i = 1, x_i] - \Er[y_i | d_i = 0 , x_i]\right] \\ \theta = & \Er\left[ \begin{array}{l} d_i \frac{y_i - \Er[y_i | d_i = 1, x_i]}{\Pr(d=1|x_i)} - (1-d_i)\frac{y_i - \Er[y_i | d_i = 0, x_i]}{1-\Pr(d=1|x_i)} + \\ + \Er[y_i | d_i = 1, x_i] - \Er[y_i | d_i = 0 , x_i]\end{array}\right] \end{align*} \]

Example: IV

\[ \begin{align*} y_i = & \theta d_i + f(x_i) + \epsilon_i \\ d_i = & g(x_i, z_i) + u_i \end{align*} \]

Interested in \(\theta\)
Assume \(\Er[\epsilon|x,z] = 0\), \(\Er[u|x,z]=0\)
Nuisance parameters \(f()\), \(g()\)

Example: LATE

Binary instrumet \(z_i \in \{0,1\}\)
Potential treatments \(d_i(0), d_i(1) \in \{0,1\}\), \(d_i = d_i(Z_i)\)
Potential outcomes \(y_i(0), y_i(1)\), observe \(y_i = y_i(d_i)\)
Covariates \(x_i\)
\((y_i(1), y_i(0), d_i(1), d_i(0)) \indep z_i | x_i\)
Local average treatment effect: \[ \begin{align*} \theta = & \Er\left[\Er[y_i(1) - y_i(0) | x, d_i(1) > d_i(0)]\right] \\ = & \Er\left[\frac{\Er[y|z=1,x] - \Er[y|z=0,x]} {\Er[d|z=1,x]-\Er[d|z=0,x]} \right] \end{align*} \]

Theory

General setup

Parameter of interest \(\theta \in \R^{d_\theta}\)
Nuisance parameter \(\eta \in T\)
Moment conditions \[ \Er[\psi(W;\theta_0,\eta_0) ] = 0 \in \R^{d_\theta} \] with \(\psi\) known
Estimate \(\hat{\eta}\) using some machine learning method
Estimate \(\hat{\theta}\) from \[ \En[\psi(w_i;\hat{\theta},\hat{\eta}) ] = 0 \]

Example: partially linear model

\[ y_i = \theta_0 d_i + f_0(x_i) + \epsilon_i \]

Compare the estimates from
1. \(\En[d_i(y_i - \tilde{\theta} d_i - \hat{f}(x_i)) ] = 0\)
and
1. \(\En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] = 0\)
where \(m(x) = \Er[d|x]\) and \(\mu(y) = \Er[y|x]\)

Example: partially linear model In the partially linear model,

\[ y_i = \theta_0 d_i + f_0(x_i) + \epsilon_i \]

we can let \(w_i = (y_i, x_i)\) and \(\eta = f\). There are a variety of candidates for \(\psi\). An obvious (but flawed) one is \(\psi(w_i; \theta, \eta) = (y_i - \theta_0 d_i - f_0(x_i))d_i\). With this choice of \(\psi\), we have

\[ \begin{align*} 0 = & \En[d_i(y_i - \hat{\theta} d_i - \hat{f}(x_i)) ] \\ \hat{\theta} = & \En[d_i^2]^{-1} \En[d_i (y_i - \hat{f}(x_i))] \\ (\hat{\theta} - \theta_0) = & \En[d_i^2]^{-1} \En[d_i \epsilon_i] + \En[d_i^2]^{-1} \En[d_i (f_0(x_i) - \hat{f}(x_i))] \end{align*} \]

The first term of this expression is quite promising. \(d_i\) and \(\epsilon_i\) are both finite dimensional random variables, so a law of large numbers will apply to \(\En[d_i^2]\), and a central limit theorem would apply to \(\sqrt{n} \En[d_i \epsilon_i]\). Unfortunately, the second expression is problematic. To accomodate high dimensional \(x\) and allow for flexible \(f()\), machine learning estimators must introduce some sort of regularization to control variance. This regularization also introduces some bias. The bias generally vanishes, but at a slower than \(\sqrt{n}\) rate. Hence

\[ \sqrt{n} \En[d_i (f_0(x_i) - \hat{f}(x_i))] \to \infty. \]

To get around this problem, we must modify our estimate of \(\theta\). Let \(m(x) = \Er[d|x]\) and \(\mu(y) = \Er[y|x]\). Let \(\hat{m}()\) and \(\hat{\mu}()\) be some estimates. Then we can estimate \(\theta\) by partialling out:

\[ \begin{align*} 0 = & \En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] \\ \hat{\theta} = & \En[(d_i -\hat{m}(x_i))^2]^{-1} \En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i))] \\ (\hat{\theta} - \theta_0) = & \En[(d_i -\hat{m}(x_i))^2]^{-1} \left(\En[(d_i - \hat{m}(x_i))\epsilon_i] + \En[(d_i - \hat{m}(x_i))(\mu(x_i) - \hat{\mu}(x_i))] \right) \\ = & \En[(d_i -\hat{m}(x_i))^2]^{-1} \left( a + b +c + d \right) \end{align*} \]

where

\[ \begin{align*} a = & \En[(d_i -m(x_i))\epsilon_i] \\ b = & \En[(m(x_i)-\hat{m}(x_i))\epsilon_i] \\ c = & \En[v_i(\mu(x_i) - \hat{\mu}(x_i))] \\ d = & \En[(m(x_i) - \hat{m}(x_i))(\mu(x_i) - \hat{\mu}(x_i))] \end{align*} \]

with \(v_i = d_i - \Er[d_i | x_i]\). The term \(a\) is well behaved and \(\sqrt{n}a \leadsto N(0,\Sigma)\) under standard conditions. Although terms \(b\) and \(c\) appear similar to the problematic term in the initial estimator, they are better behaved because \(\Er[v|x] = 0\) and \(\Er[\epsilon|x] = 0\). This makes it possible, but difficult to show that \(\sqrt{n}b \to_p = 0\) and \(\sqrt{n} c \to_p = 0\), see e.g. Belloni, Chernozhukov, and Hansen (2014). However, the conditions on \(\hat{m}\) and \(\hat{\mu}\) needed to show this are slightly restrictive, and appropriate conditions might not be known for all estimators. Chernozhukov et al. (2018) describe a sample splitting modification to \(\hat{\theta}\) that allows \(\sqrt{n} b\) and \(\sqrt{n} c\) to vanish under weaker conditions (essentially the same rate condition as needed for \(\sqrt{n} d\) to vanish.)

The last term, \(d\), is a considerable improvement upon the first estimator. Instead of involving the error in one estimate, it now involes the product of the error in two estimates. By the Cauchy-Schwarz inequality, \[ d \leq \sqrt{\En[(m(x_i) - \hat{m}(x_i))^2]} \sqrt{\En[(\mu(x_i) - \hat{\mu}(x_i))^2]}. \] So if the estimates of \(m\) and \(\mu\) converge at rates faster than \(n^{-1/4}\), then \(\sqrt{n} d \to_p 0\). This \(n^{-1/4}\) rate is reached by many machine learning estimators.

Lessons from the example

Need an extra condition on moments – Neyman orthogonality \[ \partial \eta \Er[\psi(W;\theta_0,\eta_0)](\eta-\eta_0) = 0 \]
Want estimators faster than \(n^{-1/4}\) in the prediction norm, \[ \sqrt{\En[(\hat{\eta}(x_i) - \eta(x_i))^2]} \lesssim_P n^{-1/4} \]
Also want estimators that satisfy something like \[ \sqrt{n} \En[(\eta(x_i)-\hat{\eta}(x_i))\epsilon_i] = o_p(1) \]
- Sample splitting / cross-fitting will make this easier

Cross-fitting

Randomly partition into \(K\) subsets \((I_k)_{k=1}^K\)
\(I^c_k = \{1, ..., n\} \setminus I_k\)
\(\hat{\eta}_k =\) estimate of \(\eta\) using \(I^c_k\)
Estimator: \[ \begin{align*} 0 = & \frac{1}{K} \sum_{k=1}^K \frac{K}{n} \sum_{i \in I_k} \psi(w_i;\hat{\theta}^{DML},\hat{\eta}_k) \\ 0 = & \frac{1}{K} \sum_{k=1}^K \En_k[ \psi(w_i;\hat{\theta}^{DML},\hat{\eta}_k)] \end{align*} \]

Assumptions

Neyman Orthogonality \[ \partial \eta \Er[\psi(W;\theta_0,\eta_0)](\eta-\eta_0) \approx 0 \]
Fast enough convergence of \(\hat{\eta}\) \[ \sqrt{\Er[(\hat{\eta}_k(x_i) - \eta(x_i))^2|I_k^c]} \lesssim_P n^{-1/4} \]
Various moments exist and other regularity conditions
Moment condition linear in \(\theta\) (to simplify notation only) \[ \psi(w;\theta,\eta) = \psi^a(w;\eta) \theta + \psi^b(w;\eta) \]

Asymptotic Normality

\[ \sqrt{n} \sigma^{-1} (\hat{\theta} - \theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \bar{\psi}(w_i) + o_p(1) \indist N(0,1) \] - Variance \[\sigma^2 = \Er[\psi^a(w_i;\eta_0)]^{-1} \Er\left[ \psi(w;\theta_0,\eta_0) \psi(w;\theta_0,\eta_0)'\right] \Er[\psi^a(w_i;\eta_0)]^{-1} \] - Influence function \[\bar{\psi}(w) = -\sigma^{-1} \Er[\psi^a(w_i;\eta_0)]^{-1} \psi(w;\theta_0,\eta_0)\]

Creating Orthogonal Moments

Creating orthogonal moments

Need \[ \partial \eta\Er\left[\psi(W;\theta_0,\eta_0)[\eta-\eta_0] \right] \approx 0 \]
Given an some model, how do we find a suitable \(\psi\)?
Related to finding the efficient influence function, see Oliver Hines and Vansteelandt (2022)

Orthogonal scores via concentrating-out

Original model: \[ (\theta_0, \beta_0) = \argmax_{\theta, \beta} \Er[\ell(W;\theta,\beta)] \]
Define \[ \eta(\theta) = \beta(\theta) = \argmax_\beta \Er[\ell(W;\theta,\beta)] \]
First order condition from \(\max_\theta \Er[\ell(W;\theta,\beta(\theta))]\) is \[ 0 = \Er\left[ \underbrace{\frac{\partial \ell}{\partial \theta} + \frac{\partial \ell}{\partial \beta} \frac{d \beta}{d \theta}}_{\psi(W;\theta,\beta(\theta))} \right] \]

Example: average derivative

\(x,y \in \R^1\), \(\Er[y|x] = f_0(x)\), \(p(x) =\) density of \(x\)
\(\theta_0 = \Er[f_0'(x)]\)
Joint objective \[ \min_{\theta,f} \Er\left[ (y - f(x))^2 + (\theta - f'(x)^2) \right] \]
Solve for minimizing \(f\) given \(\theta\) \[ f_\theta(x) = \Er[y|x] - \theta \partial_x \log p(x) + f''(x) + f'(x) \partial_x \log p(x) \]

Example: average derivative

Concentrated objective: \[ \min_\theta \Er\left[ (y - f_\theta(x))^2 + (\theta - f_\theta'(x)^2) \right] \]
First order condition at \(f_\theta = f_0\) gives \[ 0 = \Er\left[ (y - f_0(x))\partial_x \log p(x) + (\theta - f_0'(x)) \right] \]

Orthogonal scores via Two Other Methods

“Orthogonal” suggests ideas from linear algebra will useful, and they are
Projection: take orthogonal to \(\eta_0\) projection of moments
Riesz representer

Example: Gender Wage Gap

Data

imports

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
import statsmodels as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col

cpsall = pd.read_stata("https://www.nber.org/morg/annual/morg20.dta")
# take subset of data just to reduce computation time
cps = cpsall.sample(30000, replace=False, random_state=0)
cps.describe()

	hurespli	hhnum	county	centcity	smsastat	icntcity	smsa04	relref95	age	spouse	...	recnum	year	ym_file	ym	ch02	ch35	ch613	ch1417	ch05	ihigrdc
count	29998.000000	30000.000000	30000.000000	24785.000000	29699.000000	3775.000000	30000.000000	30000.000000	30000.000000	15421.000000	...	30000.000000	30000.0	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	20929.000000
mean	1.248017	1.050267	25.579267	1.926811	1.186942	1.399735	3.691300	42.792200	48.781800	1.574347	...	200364.281250	2020.0	725.461367	716.238567	0.053967	0.064967	0.136967	0.081267	0.099833	12.443547
std	0.617033	0.238765	61.435104	0.718238	0.389872	0.987978	2.592906	3.830515	18.922986	0.675013	...	116372.054688	0.0	3.498569	6.903731	0.225956	0.246471	0.343818	0.273249	0.299783	2.441900
min	0.000000	1.000000	0.000000	1.000000	1.000000	1.000000	0.000000	40.000000	16.000000	1.000000	...	23.000000	2020.0	720.000000	705.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	1.000000	1.000000	0.000000	1.000000	1.000000	1.000000	0.000000	40.000000	33.000000	1.000000	...	98957.250000	2020.0	722.000000	710.000000	0.000000	0.000000	0.000000	0.000000	0.000000	12.000000
50%	1.000000	1.000000	0.000000	2.000000	1.000000	1.000000	4.000000	41.000000	49.000000	2.000000	...	200032.500000	2020.0	725.000000	716.000000	0.000000	0.000000	0.000000	0.000000	0.000000	12.000000
75%	1.000000	1.000000	27.000000	2.000000	1.000000	1.000000	6.000000	42.000000	64.000000	2.000000	...	302111.750000	2020.0	729.000000	722.000000	0.000000	0.000000	0.000000	0.000000	0.000000	14.000000
max	13.000000	4.000000	810.000000	3.000000	2.000000	7.000000	7.000000	59.000000	85.000000	9.000000	...	401132.000000	2020.0	731.000000	728.000000	1.000000	1.000000	1.000000	1.000000	1.000000	18.000000

8 rows × 55 columns

Partial Linear Model

def partial_linear(y, d, X, yestimator, destimator, folds=3):
    """Estimate the partially linear model y = d*C + f(x) + e

    Parameters
    ----------
    y : array_like
        vector of outcomes
    d : array_like
        vector or matrix of regressors of interest
    X : array_like
        matrix of controls
    mlestimate : Estimator object for partialling out X. Must have ‘fit’
        and ‘predict’ methods.
    folds : int
        Number of folds for cross-fitting

    Returns
    -------
    ols : statsmodels regression results containing estimate of coefficient on d.
    yhat : cross-fitted predictions of y
    dhat : cross-fitted predictions of d
    """

    # we want predicted probabilities if y or d is discrete
    ymethod = "predict" if False==getattr(yestimator, "predict_proba",False) else "predict_proba"
    dmethod = "predict" if False==getattr(destimator, "predict_proba",False) else "predict_proba"
    # get the predictions
    yhat = cross_val_predict(yestimator,X,y,cv=folds,method=ymethod)
    dhat = cross_val_predict(destimator,X,d,cv=folds,method=dmethod)
    ey = np.array(y - yhat)
    ed = np.array(d - dhat)
    ols = sm.regression.linear_model.OLS(ey,ed).fit(cov_type='HC0')

    return(ols, yhat, dhat)

Unconditional Gap

Code

cps["female"] = (cps.sex==2)
cps["log_earn"] = np.log(cps.earnwke)
cps.loc[np.isinf(cps.log_earn), "log_earn"] = np.nan
cps["log_uhours"] = np.log(cps.uhourse)
cps.loc[np.isinf(cps.log_uhours), "log_uhours"] = np.nan
cps["log_hourslw"] = np.log(cps.hourslw)
cps.loc[np.isinf(cps.log_hourslw),"log_hourslw"] = np.nan
cps["log_wageu"] = cps.log_earn - cps.log_uhours
cps["log_wagelw"] = cps.log_earn - cps.log_hourslw

lm = list()
lm.append(smf.ols(formula="log_earn ~ female", data=cps,
                  missing="drop").fit(cov_type='HC0'))
lm.append( smf.ols(formula="log_wageu ~ female", data=cps,
                   missing="drop").fit(cov_type='HC0'))
lm.append(smf.ols(formula="log_wagelw ~ female", data=cps,
                  missing="drop").fit(cov_type='HC0'))
lm.append(smf.ols(formula="log_earn ~ female + log_hourslw + log_uhours", data=cps,
                  missing="drop").fit(cov_type='HC0'))
summary_col(lm, stars=True)

	log_earn I	log_wageu I	log_wagelw I	log_earn II
Intercept	6.8607***	3.2022***	3.2366***	2.0104***
	(0.0091)	(0.0080)	(0.0088)	(0.0945)
female[T.True]	-0.2965***	-0.1729***	-0.1746***	-0.1363***
	(0.0133)	(0.0112)	(0.0124)	(0.0115)
log_hourslw				0.0227
				(0.0223)
log_uhours				1.3025***
				(0.0372)
R-squared	0.0323	0.0166	0.0136	0.3367
R-squared Adj.	0.0323	0.0165	0.0135	0.3366

Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Adding Controls

from patsy import dmatrices
fmla  = "log_earn + female ~ log_uhours + log_hourslw + I(age**2) + age + C(grade92) + C(race) + C(smsastat) + C(unionmme) + C(unioncov)" #C(ind02) + C(occ2012)"
yd, X = dmatrices(fmla,cps)
female = yd[:,1]
logearn = yd[:,2];

Estimating \(\eta\)

alphas = np.exp(np.linspace(-2, -12, 20))
lassoy = linear_model.LassoCV(cv=4, alphas=alphas, max_iter=5000).fit(X,logearn)
lassod = linear_model.LassoCV(cv=4, alphas=alphas, max_iter=5000).fit(X,female);

Estimating \(\eta\)

Code

fig, ax = plt.subplots(1,2)

def plotlassocv(l, ax) :
    alphas = l.alphas_
    mse = l.mse_path_.mean(axis=1)
    std_error = l.mse_path_.std(axis=1)
    ax.plot(alphas,mse)
    ax.fill_between(alphas, mse + std_error, mse - std_error, alpha=0.2)

    ax.set_ylabel('MSE +/- std error')
    ax.set_xlabel('alpha')
    ax.set_xlim([alphas[0], alphas[-1]])
    ax.set_xscale("log")
    return(ax)

ax[0] = plotlassocv(lassoy,ax[0])
ax[0].set_title("MSE for log(earn)")
ax[1] = plotlassocv(lassod,ax[1])
ax[1].set_title("MSE for female")
fig.tight_layout()
fig.show()

Estimating \(\eta\)

def pickalpha(lassocv) :
    #imin = np.argmin(lassocv.mse_path_.mean(axis=1))
    #msemin = lassocv.mse_path_.mean(axis=1)[imin]
    #se = lassocv.mse_path_.std(axis=1)[imin]
    #alpha= min([alpha for (alpha, mse) in zip(lassocv.alphas_, lassocv.mse_path_.mean(axis=1)) if mse<msemin+se])
    alpha = lassocv.alpha_
    return(alpha)

alphay = pickalpha(lassoy)
alphad = pickalpha(lassod)

Estimate \(\theta\)

pl_lasso = partial_linear(logearn, female, X,
                          linear_model.Lasso(alpha=alphay),
                          linear_model.Lasso(alpha=alphad))
pl_lasso[0].summary(slim=True)

OLS Regression Results
Dep. Variable:	y	F-statistic:
Model:	OLS	Prob (F-statistic):
No. Observations:	12038
Covariance Type:	HC0

	coef	std err	z	P>\|z\|	[0.025	0.975]
x1	-0.1868	0.011	-16.643	0.000	-0.209	-0.165

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors are heteroscedasticity robust (HC0)

Examining Predictions

Code

import seaborn as sns
# Visualize predictions
def preddf(pl):
    df = pd.DataFrame({"logearn":logearn,
                       "predicted":pl[1],
                       "female":female,
                       "P(female|x)":pl[2]})
    return(df)

def plotpredictions(df) :
    fig, ax = plt.subplots(2,1)
    plt.figure()
    sns.scatterplot(x = df.predicted, y = df.logearn-df.predicted, hue=df.female, ax=ax[0])
    ax[0].set_title("Prediction Errors for Log Earnings")

    sns.histplot(df["P(female|x)"][df.female==0], kde = False,
                 label = "Male", ax=ax[1])
    sns.histplot(df["P(female|x)"][df.female==1], kde = False,
                 label = "Female", ax=ax[1])
    ax[1].set_title('P(female|x)')
    return(fig)

fig=plotpredictions(preddf(pl_lasso))
fig.show()

<Figure size 960x480 with 0 Axes>

Software

Using `doubleml`

Code

import doubleml
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

dml_data = doubleml.DoubleMLData.from_arrays(X,logearn,female)
# Initialize learners
Cs = 0.0001*np.logspace(0, 4, 10)
lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))
lasso_class = make_pipeline(StandardScaler(),
                            LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear',
                                                 Cs = Cs, max_iter=1000))

np.random.seed(123)
dml_plr_lasso = doubleml.DoubleMLPLR(dml_data,
                                     ml_l = lasso,
                                     ml_m = lasso_class,
                                     n_folds = 3)
dml_plr_lasso.fit(store_predictions=True)
print(dml_plr_lasso.summary)

       coef   std err          t         P>|t|     2.5 %   97.5 %
d -0.181669  0.011178 -16.252139  2.157063e-59 -0.203578 -0.15976

Visualizing Predictions: Lasso & Logistic Lasso

Code

def dmlpreddf(dml_model):
    df=pd.DataFrame({"logearn":logearn,
                     "predicted":dml_model.predictions['ml_l'].flatten(),
                     "female":female,
                     "P(female|x)":dml_model.predictions['ml_m'].flatten()})
    return(df)

plotpredictions(dmlpreddf(dml_plr_lasso)).show()

<Figure size 960x480 with 0 Axes>

With Gradient Boosted Trees

Code

from lightgbm import LGBMClassifier, LGBMRegressor
ylearner = LGBMRegressor(force_col_wise=True)
dlearner = LGBMClassifier(force_col_wise=True)
dml_plr_gbt = doubleml.DoubleMLPLR(dml_data, ylearner, dlearner)
dml_plr_gbt.fit(store_predictions=True)
print(dml_plr_gbt.summary)

[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] Start training from score 6.719951
[LightGBM] [Info] Total Bins 290
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] Start training from score 6.715158
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] Start training from score 6.719371
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] Start training from score 6.719395
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] Start training from score 6.714159
[LightGBM] [Info] Number of positive: 4711, number of negative: 4919
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.489200 -> initscore=-0.043205
[LightGBM] [Info] Start training from score -0.043205
[LightGBM] [Info] Number of positive: 4647, number of negative: 4983
[LightGBM] [Info] Total Bins 290
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482555 -> initscore=-0.069810
[LightGBM] [Info] Start training from score -0.069810
[LightGBM] [Info] Number of positive: 4701, number of negative: 4929
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.488162 -> initscore=-0.047361
[LightGBM] [Info] Start training from score -0.047361
[LightGBM] [Info] Number of positive: 4678, number of negative: 4953
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.485723 -> initscore=-0.057123
[LightGBM] [Info] Start training from score -0.057123
[LightGBM] [Info] Number of positive: 4703, number of negative: 4928
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.488319 -> initscore=-0.046733
[LightGBM] [Info] Start training from score -0.046733
       coef   std err          t         P>|t|     2.5 %    97.5 %
d -0.163842  0.011051 -14.826095  9.934713e-50 -0.185501 -0.142182

Visualizing Predictions: Trees

Code

plotpredictions(dmlpreddf(dml_plr_gbt)).show()

<Figure size 960x480 with 0 Axes>

Interactive Regression Model

Similar to matching, the partially linear regression model can suffer from misspecification bias if the effect of \(D\) varies with \(X\)
Interactive regression model: \[ \begin{align*} Y & = g_0(D,X) + U \\ D & = m_0(X) + V \end{align*} \]
Mechanics same as matching heterogeneous effects
Orthogonal moment condition is same as doubly robust matching

Interactive Regression Model - Lasso

Code

np.random.seed(123)
dml_irm_lasso = doubleml.DoubleMLIRM(dml_data,
                                     ml_g = lasso,
                                     ml_m = lasso_class,
                                     trimming_threshold = 0.01,
                                     n_folds = 3)
dml_irm_lasso.fit()
dml_irm_lasso.summary

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
d	-0.229206	0.03706	-6.184647	6.224143e-10	-0.301843	-0.156569

Interactive Regression Model - Trees

Code

np.random.seed(123)
dml_irm_gbt = doubleml.DoubleMLIRM(dml_data,
                                     ml_g = ylearner,
                                     ml_m = dlearner,
                                     trimming_threshold = 0.01,
                                     n_folds = 3)
dml_irm_gbt.fit()
dml_irm_gbt.summary

[LightGBM] [Info] Total Bins 257
[LightGBM] [Info] Number of data points in the train set: 4119, number of used features: 25
[LightGBM] [Info] Start training from score 6.868347
[LightGBM] [Info] Total Bins 252
[LightGBM] [Info] Number of data points in the train set: 4118, number of used features: 23
[LightGBM] [Info] Start training from score 6.859271
[LightGBM] [Info] Total Bins 255
[LightGBM] [Info] Number of data points in the train set: 4119, number of used features: 24
[LightGBM] [Info] Start training from score 6.857057
[LightGBM] [Info] Total Bins 254
[LightGBM] [Info] Number of data points in the train set: 3906, number of used features: 22
[LightGBM] [Info] Start training from score 6.576155
[LightGBM] [Info] Total Bins 263
[LightGBM] [Info] Number of data points in the train set: 3907, number of used features: 24
[LightGBM] [Info] Start training from score 6.567823
[LightGBM] [Info] Total Bins 253
[LightGBM] [Info] Number of data points in the train set: 3907, number of used features: 21
[LightGBM] [Info] Start training from score 6.553555
[LightGBM] [Info] Number of positive: 3906, number of negative: 4119
[LightGBM] [Info] Total Bins 286
[LightGBM] [Info] Number of data points in the train set: 8025, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.486729 -> initscore=-0.053097
[LightGBM] [Info] Start training from score -0.053097
[LightGBM] [Info] Number of positive: 3907, number of negative: 4118
[LightGBM] [Info] Total Bins 286
[LightGBM] [Info] Number of data points in the train set: 8025, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.486854 -> initscore=-0.052598
[LightGBM] [Info] Start training from score -0.052598
[LightGBM] [Info] Number of positive: 3907, number of negative: 4119
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8026, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.486793 -> initscore=-0.052841
[LightGBM] [Info] Start training from score -0.052841

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
d	-0.159562	0.013846	-11.524304	9.951303e-31	-0.1867	-0.132425

Sources and Further Reading

These slides borrow heavily from my notes on machine learning in economics
The example is from my chapter on machine learning in QuantEcon: DataScience
Chernozhukov et al. (2017) : short introduction to main idea
Chernozhukov et al. (2018) : underlying theory
Knaus (2022) : approachable review of DML for doubly robust matching

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “Inference on Treatment Effects After Selection Among High-Dimensional Controls†.” The Review of Economic Studies 81 (2): 608–50. https://doi.org/10.1093/restud/rdt044.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. “Double/Debiased/Neyman Machine Learning of Treatment Effects.” American Economic Review 107 (5): 261–65. https://doi.org/10.1257/aer.p20171038.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.

Chernozhukov, Victor, Christian Hansen, and Martin Spindler. 2015. “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics 7 (1): 649–88. https://doi.org/10.1146/annurev-economics-012315-015826.

Knaus, Michael C. 2022. “Double machine learning-based programme evaluation under unconfoundedness.” The Econometrics Journal 25 (3): 602–27. https://doi.org/10.1093/ectj/utac015.

Oliver Hines, Karla Diaz-Ordaz, Oliver Dukes, and Stijn Vansteelandt. 2022. “Demystifying Statistical Learning Based on Efficient Influence Functions.” The American Statistician 76 (3): 292–304. https://doi.org/10.1080/00031305.2021.2021984.

Double / Debiased Machine Learning

Introduction

Introduction

Machine Learning for Nuisance Parameters

Example: partially linear model

Example: Matching

Example: Matching

Example: IV

Example: LATE

Theory

General setup

Example: partially linear model

Lessons from the example

Cross-fitting

Assumptions

Asymptotic Normality

Creating Orthogonal Moments

Creating orthogonal moments

Orthogonal scores via concentrating-out

Example: average derivative

Example: average derivative

Orthogonal scores via Two Other Methods

Example: Gender Wage Gap

Data

Partial Linear Model

Unconditional Gap

Adding Controls

Estimating \(\eta\)

Estimating \(\eta\)

Estimating \(\eta\)

Estimate \(\theta\)

Examining Predictions

Software

Using doubleml

Visualizing Predictions: Lasso & Logistic Lasso

With Gradient Boosted Trees

Visualizing Predictions: Trees

Interactive Regression Model

Interactive Regression Model - Lasso

Interactive Regression Model - Trees

Sources and Further Reading

References

Using `doubleml`