Double / Debiased Machine Learning

ECON526

Paul Schrimpf

University of British Columbia

Introduction

\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \def\inprob{{\,{\buildrel p \over \rightarrow}\,}} \def\indist{\,{\buildrel d \over \rightarrow}\,} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \]

Introduction

  • Many estimation tasks involve:
    • Low dimensional parameter of interest
    • High dimensional nuisance parameters needed to recover parameter of interest
  • Example: in matching
    • Interested in \(ATE = \Er[Y(1) - Y(0)]\)
    • Nuisance parameters \(\Er[Y|D,X]\) and/or \(P(D=1|X)\)

Machine Learning for Nuisance Parameters

  • In matching we said that we could use flexible machine learning estimators for \(\Er[Y|D,X]\) and \(P(D=1|X)\) and plug them into the doubly robust estimator

  • We will cover:

    • Some of the technical details behind this idea
    • How this idea can be applied to other estimators

Example: partially linear model

\[ y_i = \theta d_i + f(x_i) + \epsilon_i \]

  • Interested in \(\theta\)
  • Assume \(\Er[\epsilon|d,x] = 0\)
  • Nuisance parameter \(f()\)

Example: Matching

  • Binary treatment \(d_i \in \{0,1\}\)
  • Potential outcomes \(y_i(0), y_i(1)\), observe \(y_i = y_i(d_i)\)
  • Interested in average treatment effect : \(\theta = \Er[y_i(1) - y_i(0)]\)
  • Covariates \(x_i\)
  • Assume unconfoundedness : \(d_i \indep y_i(1), y_i(0) | x_i\)

Example: Matching

  • Estimatable formulae for ATE : \[ \begin{align*} \theta = & \Er\left[\frac{y_i d_i}{\Pr(d = 1 | x_i)} - \frac{y_i (1-d_i)}{1-\Pr(d=1|x_i)} \right] \\ \theta = & \Er\left[\Er[y_i | d_i = 1, x_i] - \Er[y_i | d_i = 0 , x_i]\right] \\ \theta = & \Er\left[ \begin{array}{l} d_i \frac{y_i - \Er[y_i | d_i = 1, x_i]}{\Pr(d=1|x_i)} - (1-d_i)\frac{y_i - \Er[y_i | d_i = 0, x_i]}{1-\Pr(d=1|x_i)} + \\ + \Er[y_i | d_i = 1, x_i] - \Er[y_i | d_i = 0 , x_i]\end{array}\right] \end{align*} \]

Example: IV

\[ \begin{align*} y_i = & \theta d_i + f(x_i) + \epsilon_i \\ d_i = & g(x_i, z_i) + u_i \end{align*} \]

  • Interested in \(\theta\)
  • Assume \(\Er[\epsilon|x,z] = 0\), \(\Er[u|x,z]=0\)
  • Nuisance parameters \(f()\), \(g()\)

Example: LATE

  • Binary instrumet \(z_i \in \{0,1\}\)
  • Potential treatments \(d_i(0), d_i(1) \in \{0,1\}\), \(d_i = d_i(Z_i)\)
  • Potential outcomes \(y_i(0), y_i(1)\), observe \(y_i = y_i(d_i)\)
  • Covariates \(x_i\)
  • \((y_i(1), y_i(0), d_i(1), d_i(0)) \indep z_i | x_i\)
  • Local average treatment effect: \[ \begin{align*} \theta = & \Er\left[\Er[y_i(1) - y_i(0) | x, d_i(1) > d_i(0)]\right] \\ = & \Er\left[\frac{\Er[y|z=1,x] - \Er[y|z=0,x]} {\Er[d|z=1,x]-\Er[d|z=0,x]} \right] \end{align*} \]

Theory

General setup

  • Parameter of interest \(\theta \in \R^{d_\theta}\)

  • Nuisance parameter \(\eta \in T\)

  • Moment conditions \[ \Er[\psi(W;\theta_0,\eta_0) ] = 0 \in \R^{d_\theta} \] with \(\psi\) known

  • Estimate \(\hat{\eta}\) using some machine learning method

  • Estimate \(\hat{\theta}\) from \[ \En[\psi(w_i;\hat{\theta},\hat{\eta}) ] = 0 \]

Example: partially linear model

\[ y_i = \theta_0 d_i + f_0(x_i) + \epsilon_i \]

  • Compare the estimates from

    1. \(\En[d_i(y_i - \tilde{\theta} d_i - \hat{f}(x_i)) ] = 0\)

    and

    1. \(\En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] = 0\)

    where \(m(x) = \Er[d|x]\) and \(\mu(y) = \Er[y|x]\)

Lessons from the example

  • Need an extra condition on moments – Neyman orthogonality \[ \partial \eta \Er[\psi(W;\theta_0,\eta_0)](\eta-\eta_0) = 0 \]

  • Want estimators faster than \(n^{-1/4}\) in the prediction norm, \[ \sqrt{\En[(\hat{\eta}(x_i) - \eta(x_i))^2]} \lesssim_P n^{-1/4} \]

  • Also want estimators that satisfy something like \[ \sqrt{n} \En[(\eta(x_i)-\hat{\eta}(x_i))\epsilon_i] = o_p(1) \]

    • Sample splitting / cross-fitting will make this easier

Cross-fitting

  • Randomly partition into \(K\) subsets \((I_k)_{k=1}^K\)
  • \(I^c_k = \{1, ..., n\} \setminus I_k\)
  • \(\hat{\eta}_k =\) estimate of \(\eta\) using \(I^c_k\)
  • Estimator: \[ \begin{align*} 0 = & \frac{1}{K} \sum_{k=1}^K \frac{K}{n} \sum_{i \in I_k} \psi(w_i;\hat{\theta}^{DML},\hat{\eta}_k) \\ 0 = & \frac{1}{K} \sum_{k=1}^K \En_k[ \psi(w_i;\hat{\theta}^{DML},\hat{\eta}_k)] \end{align*} \]

Assumptions

  1. Neyman Orthogonality \[ \partial \eta \Er[\psi(W;\theta_0,\eta_0)](\eta-\eta_0) \approx 0 \]
  2. Fast enough convergence of \(\hat{\eta}\) \[ \sqrt{\Er[(\hat{\eta}_k(x_i) - \eta(x_i))^2|I_k^c]} \lesssim_P n^{-1/4} \]
  3. Various moments exist and other regularity conditions
  4. Moment condition linear in \(\theta\) (to simplify notation only) \[ \psi(w;\theta,\eta) = \psi^a(w;\eta) \theta + \psi^b(w;\eta) \]

1

Asymptotic Normality

\[ \sqrt{n} \sigma^{-1} (\hat{\theta} - \theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \bar{\psi}(w_i) + o_p(1) \indist N(0,1) \] - Variance \[\sigma^2 = \Er[\psi^a(w_i;\eta_0)]^{-1} \Er\left[ \psi(w;\theta_0,\eta_0) \psi(w;\theta_0,\eta_0)'\right] \Er[\psi^a(w_i;\eta_0)]^{-1} \] - Influence function \[\bar{\psi}(w) = -\sigma^{-1} \Er[\psi^a(w_i;\eta_0)]^{-1} \psi(w;\theta_0,\eta_0)\]

Creating Orthogonal Moments

Creating orthogonal moments

  • Need \[ \partial \eta\Er\left[\psi(W;\theta_0,\eta_0)[\eta-\eta_0] \right] \approx 0 \]

  • Given an some model, how do we find a suitable \(\psi\)?

  • Related to finding the efficient influence function, see Oliver Hines and Vansteelandt (2022)

Orthogonal scores via concentrating-out

  • Original model: \[ (\theta_0, \beta_0) = \argmax_{\theta, \beta} \Er[\ell(W;\theta,\beta)] \]
  • Define \[ \eta(\theta) = \beta(\theta) = \argmax_\beta \Er[\ell(W;\theta,\beta)] \]
  • First order condition from \(\max_\theta \Er[\ell(W;\theta,\beta(\theta))]\) is \[ 0 = \Er\left[ \underbrace{\frac{\partial \ell}{\partial \theta} + \frac{\partial \ell}{\partial \beta} \frac{d \beta}{d \theta}}_{\psi(W;\theta,\beta(\theta))} \right] \]

Example: average derivative

  • \(x,y \in \R^1\), \(\Er[y|x] = f_0(x)\), \(p(x) =\) density of \(x\)

  • \(\theta_0 = \Er[f_0'(x)]\)

  • Joint objective \[ \min_{\theta,f} \Er\left[ (y - f(x))^2 + (\theta - f'(x)^2) \right] \]

  • Solve for minimizing \(f\) given \(\theta\) \[ f_\theta(x) = \Er[y|x] - \theta \partial_x \log p(x) + f''(x) + f'(x) \partial_x \log p(x) \]

Example: average derivative

  • Concentrated objective: \[ \min_\theta \Er\left[ (y - f_\theta(x))^2 + (\theta - f_\theta'(x)^2) \right] \]

  • First order condition at \(f_\theta = f_0\) gives \[ 0 = \Er\left[ (y - f_0(x))\partial_x \log p(x) + (\theta - f_0'(x)) \right] \]

Orthogonal scores via Two Other Methods

  • “Orthogonal” suggests ideas from linear algebra will useful, and they are

  • Projection: take orthogonal to \(\eta_0\) projection of moments

  • Riesz representer

Example: Gender Wage Gap

Data

imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
import statsmodels as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
cpsall = pd.read_stata("https://www.nber.org/morg/annual/morg20.dta")
# take subset of data just to reduce computation time
cps = cpsall.sample(30000, replace=False, random_state=0)
cps.describe()
hurespli hhnum county centcity smsastat icntcity smsa04 relref95 age spouse ... recnum year ym_file ym ch02 ch35 ch613 ch1417 ch05 ihigrdc
count 29998.000000 30000.000000 30000.000000 24785.000000 29699.000000 3775.000000 30000.000000 30000.000000 30000.000000 15421.000000 ... 30000.000000 30000.0 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 20929.000000
mean 1.248017 1.050267 25.579267 1.926811 1.186942 1.399735 3.691300 42.792200 48.781800 1.574347 ... 200364.281250 2020.0 725.461367 716.238567 0.053967 0.064967 0.136967 0.081267 0.099833 12.443547
std 0.617033 0.238765 61.435104 0.718238 0.389872 0.987978 2.592906 3.830515 18.922986 0.675013 ... 116372.054688 0.0 3.498569 6.903731 0.225956 0.246471 0.343818 0.273249 0.299783 2.441900
min 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 0.000000 40.000000 16.000000 1.000000 ... 23.000000 2020.0 720.000000 705.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 0.000000 40.000000 33.000000 1.000000 ... 98957.250000 2020.0 722.000000 710.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.000000
50% 1.000000 1.000000 0.000000 2.000000 1.000000 1.000000 4.000000 41.000000 49.000000 2.000000 ... 200032.500000 2020.0 725.000000 716.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.000000
75% 1.000000 1.000000 27.000000 2.000000 1.000000 1.000000 6.000000 42.000000 64.000000 2.000000 ... 302111.750000 2020.0 729.000000 722.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000
max 13.000000 4.000000 810.000000 3.000000 2.000000 7.000000 7.000000 59.000000 85.000000 9.000000 ... 401132.000000 2020.0 731.000000 728.000000 1.000000 1.000000 1.000000 1.000000 1.000000 18.000000

8 rows × 55 columns

Partial Linear Model

def partial_linear(y, d, X, yestimator, destimator, folds=3):
    """Estimate the partially linear model y = d*C + f(x) + e

    Parameters
    ----------
    y : array_like
        vector of outcomes
    d : array_like
        vector or matrix of regressors of interest
    X : array_like
        matrix of controls
    mlestimate : Estimator object for partialling out X. Must have ‘fit’
        and ‘predict’ methods.
    folds : int
        Number of folds for cross-fitting

    Returns
    -------
    ols : statsmodels regression results containing estimate of coefficient on d.
    yhat : cross-fitted predictions of y
    dhat : cross-fitted predictions of d
    """

    # we want predicted probabilities if y or d is discrete
    ymethod = "predict" if False==getattr(yestimator, "predict_proba",False) else "predict_proba"
    dmethod = "predict" if False==getattr(destimator, "predict_proba",False) else "predict_proba"
    # get the predictions
    yhat = cross_val_predict(yestimator,X,y,cv=folds,method=ymethod)
    dhat = cross_val_predict(destimator,X,d,cv=folds,method=dmethod)
    ey = np.array(y - yhat)
    ed = np.array(d - dhat)
    ols = sm.regression.linear_model.OLS(ey,ed).fit(cov_type='HC0')

    return(ols, yhat, dhat)

Unconditional Gap

Code
cps["female"] = (cps.sex==2)
cps["log_earn"] = np.log(cps.earnwke)
cps.loc[np.isinf(cps.log_earn), "log_earn"] = np.nan
cps["log_uhours"] = np.log(cps.uhourse)
cps.loc[np.isinf(cps.log_uhours), "log_uhours"] = np.nan
cps["log_hourslw"] = np.log(cps.hourslw)
cps.loc[np.isinf(cps.log_hourslw),"log_hourslw"] = np.nan
cps["log_wageu"] = cps.log_earn - cps.log_uhours
cps["log_wagelw"] = cps.log_earn - cps.log_hourslw

lm = list()
lm.append(smf.ols(formula="log_earn ~ female", data=cps,
                  missing="drop").fit(cov_type='HC0'))
lm.append( smf.ols(formula="log_wageu ~ female", data=cps,
                   missing="drop").fit(cov_type='HC0'))
lm.append(smf.ols(formula="log_wagelw ~ female", data=cps,
                  missing="drop").fit(cov_type='HC0'))
lm.append(smf.ols(formula="log_earn ~ female + log_hourslw + log_uhours", data=cps,
                  missing="drop").fit(cov_type='HC0'))
summary_col(lm, stars=True)
log_earn I log_wageu I log_wagelw I log_earn II
Intercept 6.8607*** 3.2022*** 3.2366*** 2.0104***
(0.0091) (0.0080) (0.0088) (0.0945)
female[T.True] -0.2965*** -0.1729*** -0.1746*** -0.1363***
(0.0133) (0.0112) (0.0124) (0.0115)
log_hourslw 0.0227
(0.0223)
log_uhours 1.3025***
(0.0372)
R-squared 0.0323 0.0166 0.0136 0.3367
R-squared Adj. 0.0323 0.0165 0.0135 0.3366

Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Adding Controls

from patsy import dmatrices
fmla  = "log_earn + female ~ log_uhours + log_hourslw + I(age**2) + age + C(grade92) + C(race) + C(smsastat) + C(unionmme) + C(unioncov)" #C(ind02) + C(occ2012)"
yd, X = dmatrices(fmla,cps)
female = yd[:,1]
logearn = yd[:,2];

Estimating \(\eta\)

alphas = np.exp(np.linspace(-2, -12, 20))
lassoy = linear_model.LassoCV(cv=4, alphas=alphas, max_iter=5000).fit(X,logearn)
lassod = linear_model.LassoCV(cv=4, alphas=alphas, max_iter=5000).fit(X,female);

Estimating \(\eta\)

Code
fig, ax = plt.subplots(1,2)

def plotlassocv(l, ax) :
    alphas = l.alphas_
    mse = l.mse_path_.mean(axis=1)
    std_error = l.mse_path_.std(axis=1)
    ax.plot(alphas,mse)
    ax.fill_between(alphas, mse + std_error, mse - std_error, alpha=0.2)

    ax.set_ylabel('MSE +/- std error')
    ax.set_xlabel('alpha')
    ax.set_xlim([alphas[0], alphas[-1]])
    ax.set_xscale("log")
    return(ax)

ax[0] = plotlassocv(lassoy,ax[0])
ax[0].set_title("MSE for log(earn)")
ax[1] = plotlassocv(lassod,ax[1])
ax[1].set_title("MSE for female")
fig.tight_layout()
fig.show()

Estimating \(\eta\)

def pickalpha(lassocv) :
    #imin = np.argmin(lassocv.mse_path_.mean(axis=1))
    #msemin = lassocv.mse_path_.mean(axis=1)[imin]
    #se = lassocv.mse_path_.std(axis=1)[imin]
    #alpha= min([alpha for (alpha, mse) in zip(lassocv.alphas_, lassocv.mse_path_.mean(axis=1)) if mse<msemin+se])
    alpha = lassocv.alpha_
    return(alpha)

alphay = pickalpha(lassoy)
alphad = pickalpha(lassod)

Estimate \(\theta\)

pl_lasso = partial_linear(logearn, female, X,
                          linear_model.Lasso(alpha=alphay),
                          linear_model.Lasso(alpha=alphad))
pl_lasso[0].summary(slim=True)
OLS Regression Results
Dep. Variable: y F-statistic:
Model: OLS Prob (F-statistic):
No. Observations: 12038
Covariance Type: HC0
coef std err z P>|z| [0.025 0.975]
x1 -0.1868 0.011 -16.643 0.000 -0.209 -0.165


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors are heteroscedasticity robust (HC0)

Examining Predictions

Code
import seaborn as sns
# Visualize predictions
def preddf(pl):
    df = pd.DataFrame({"logearn":logearn,
                       "predicted":pl[1],
                       "female":female,
                       "P(female|x)":pl[2]})
    return(df)

def plotpredictions(df) :
    fig, ax = plt.subplots(2,1)
    plt.figure()
    sns.scatterplot(x = df.predicted, y = df.logearn-df.predicted, hue=df.female, ax=ax[0])
    ax[0].set_title("Prediction Errors for Log Earnings")

    sns.histplot(df["P(female|x)"][df.female==0], kde = False,
                 label = "Male", ax=ax[1])
    sns.histplot(df["P(female|x)"][df.female==1], kde = False,
                 label = "Female", ax=ax[1])
    ax[1].set_title('P(female|x)')
    return(fig)

fig=plotpredictions(preddf(pl_lasso))
fig.show()

<Figure size 960x480 with 0 Axes>

Software

Using doubleml

Code
import doubleml
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

dml_data = doubleml.DoubleMLData.from_arrays(X,logearn,female)
# Initialize learners
Cs = 0.0001*np.logspace(0, 4, 10)
lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))
lasso_class = make_pipeline(StandardScaler(),
                            LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear',
                                                 Cs = Cs, max_iter=1000))

np.random.seed(123)
dml_plr_lasso = doubleml.DoubleMLPLR(dml_data,
                                     ml_l = lasso,
                                     ml_m = lasso_class,
                                     n_folds = 3)
dml_plr_lasso.fit(store_predictions=True)
print(dml_plr_lasso.summary)
       coef   std err          t         P>|t|     2.5 %   97.5 %
d -0.181669  0.011178 -16.252139  2.157063e-59 -0.203578 -0.15976

Visualizing Predictions: Lasso & Logistic Lasso

Code
def dmlpreddf(dml_model):
    df=pd.DataFrame({"logearn":logearn,
                     "predicted":dml_model.predictions['ml_l'].flatten(),
                     "female":female,
                     "P(female|x)":dml_model.predictions['ml_m'].flatten()})
    return(df)

plotpredictions(dmlpreddf(dml_plr_lasso)).show()

<Figure size 960x480 with 0 Axes>

With Gradient Boosted Trees

Code
from lightgbm import LGBMClassifier, LGBMRegressor
ylearner = LGBMRegressor(force_col_wise=True)
dlearner = LGBMClassifier(force_col_wise=True)
dml_plr_gbt = doubleml.DoubleMLPLR(dml_data, ylearner, dlearner)
dml_plr_gbt.fit(store_predictions=True)
print(dml_plr_gbt.summary)
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] Start training from score 6.719951
[LightGBM] [Info] Total Bins 290
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] Start training from score 6.715158
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] Start training from score 6.719371
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] Start training from score 6.719395
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] Start training from score 6.714159
[LightGBM] [Info] Number of positive: 4711, number of negative: 4919
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.489200 -> initscore=-0.043205
[LightGBM] [Info] Start training from score -0.043205
[LightGBM] [Info] Number of positive: 4647, number of negative: 4983
[LightGBM] [Info] Total Bins 290
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482555 -> initscore=-0.069810
[LightGBM] [Info] Start training from score -0.069810
[LightGBM] [Info] Number of positive: 4701, number of negative: 4929
[LightGBM] [Info] Total Bins 292
[LightGBM] [Info] Number of data points in the train set: 9630, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.488162 -> initscore=-0.047361
[LightGBM] [Info] Start training from score -0.047361
[LightGBM] [Info] Number of positive: 4678, number of negative: 4953
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.485723 -> initscore=-0.057123
[LightGBM] [Info] Start training from score -0.057123
[LightGBM] [Info] Number of positive: 4703, number of negative: 4928
[LightGBM] [Info] Total Bins 289
[LightGBM] [Info] Number of data points in the train set: 9631, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.488319 -> initscore=-0.046733
[LightGBM] [Info] Start training from score -0.046733
       coef   std err          t         P>|t|     2.5 %    97.5 %
d -0.163842  0.011051 -14.826095  9.934713e-50 -0.185501 -0.142182

Visualizing Predictions: Trees

Code
plotpredictions(dmlpreddf(dml_plr_gbt)).show()

<Figure size 960x480 with 0 Axes>

Interactive Regression Model

  • Similar to matching, the partially linear regression model can suffer from misspecification bias if the effect of \(D\) varies with \(X\)
  • Interactive regression model: \[ \begin{align*} Y & = g_0(D,X) + U \\ D & = m_0(X) + V \end{align*} \]
  • Mechanics same as matching heterogeneous effects
  • Orthogonal moment condition is same as doubly robust matching

Interactive Regression Model - Lasso

Code
np.random.seed(123)
dml_irm_lasso = doubleml.DoubleMLIRM(dml_data,
                                     ml_g = lasso,
                                     ml_m = lasso_class,
                                     trimming_threshold = 0.01,
                                     n_folds = 3)
dml_irm_lasso.fit()
dml_irm_lasso.summary
coef std err t P>|t| 2.5 % 97.5 %
d -0.229206 0.03706 -6.184647 6.224143e-10 -0.301843 -0.156569

Interactive Regression Model - Trees

Code
np.random.seed(123)
dml_irm_gbt = doubleml.DoubleMLIRM(dml_data,
                                     ml_g = ylearner,
                                     ml_m = dlearner,
                                     trimming_threshold = 0.01,
                                     n_folds = 3)
dml_irm_gbt.fit()
dml_irm_gbt.summary
[LightGBM] [Info] Total Bins 257
[LightGBM] [Info] Number of data points in the train set: 4119, number of used features: 25
[LightGBM] [Info] Start training from score 6.868347
[LightGBM] [Info] Total Bins 252
[LightGBM] [Info] Number of data points in the train set: 4118, number of used features: 23
[LightGBM] [Info] Start training from score 6.859271
[LightGBM] [Info] Total Bins 255
[LightGBM] [Info] Number of data points in the train set: 4119, number of used features: 24
[LightGBM] [Info] Start training from score 6.857057
[LightGBM] [Info] Total Bins 254
[LightGBM] [Info] Number of data points in the train set: 3906, number of used features: 22
[LightGBM] [Info] Start training from score 6.576155
[LightGBM] [Info] Total Bins 263
[LightGBM] [Info] Number of data points in the train set: 3907, number of used features: 24
[LightGBM] [Info] Start training from score 6.567823
[LightGBM] [Info] Total Bins 253
[LightGBM] [Info] Number of data points in the train set: 3907, number of used features: 21
[LightGBM] [Info] Start training from score 6.553555
[LightGBM] [Info] Number of positive: 3906, number of negative: 4119
[LightGBM] [Info] Total Bins 286
[LightGBM] [Info] Number of data points in the train set: 8025, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.486729 -> initscore=-0.053097
[LightGBM] [Info] Start training from score -0.053097
[LightGBM] [Info] Number of positive: 3907, number of negative: 4118
[LightGBM] [Info] Total Bins 286
[LightGBM] [Info] Number of data points in the train set: 8025, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.486854 -> initscore=-0.052598
[LightGBM] [Info] Start training from score -0.052598
[LightGBM] [Info] Number of positive: 3907, number of negative: 4119
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8026, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.486793 -> initscore=-0.052841
[LightGBM] [Info] Start training from score -0.052841
coef std err t P>|t| 2.5 % 97.5 %
d -0.159562 0.013846 -11.524304 9.951303e-31 -0.1867 -0.132425

Sources and Further Reading

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “Inference on Treatment Effects After Selection Among High-Dimensional Controls†.” The Review of Economic Studies 81 (2): 608–50. https://doi.org/10.1093/restud/rdt044.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. “Double/Debiased/Neyman Machine Learning of Treatment Effects.” American Economic Review 107 (5): 261–65. https://doi.org/10.1257/aer.p20171038.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.
Chernozhukov, Victor, Christian Hansen, and Martin Spindler. 2015. “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics 7 (1): 649–88. https://doi.org/10.1146/annurev-economics-012315-015826.
Knaus, Michael C. 2022. Double machine learning-based programme evaluation under unconfoundedness.” The Econometrics Journal 25 (3): 602–27. https://doi.org/10.1093/ectj/utac015.
Oliver Hines, Karla Diaz-Ordaz, Oliver Dukes, and Stijn Vansteelandt. 2022. “Demystifying Statistical Learning Based on Efficient Influence Functions.” The American Statistician 76 (3): 292–304. https://doi.org/10.1080/00031305.2021.2021984.