ECON 626: Midterm

Published

October 16, 2024

Problem 1

\[ \def\R{{\mathbb{R}}} \def\Er{{\mathrm{E}}} \def\var{{\mathrm{Var}}} \newcommand\norm[1]{\left\lVert#1\right\rVert} \def\cov{{\mathrm{Cov}}} \def\En{{\mathbb{En}}} \def\rank{{\mathrm{rank}}} \newcommand{\inpr}{ \overset{p^*_{\scriptscriptstyle n}}{\longrightarrow}} \def\inprob{{\,{\buildrel p \over \rightarrow}\,}} \def\indist{\,{\buildrel d \over \rightarrow}\,} \DeclareMathOperator*{\plim}{plim} \]

Suppose \(y_i^* = x_i' \beta + \epsilon_i\) for \(i=1, ..., n\) with \(x_i \in \R^k\). Also suppose \(y_i^*\) is not always observed. Instead, you only observe \(y_i = y_i^*\) if \(o_i = 1\). The observed data is \(\left\lbrace\left(x_i, o_i, y_i = \begin{cases} y_i^* & \text{ if } o_i = 1 \\ \text{missing} & \text{ otherwise} \end{cases}\right)\right\rbrace_{i=1}^n\). Throughout, assume that observations for different \(i\) are independent and identically distributed, \(\Er[\epsilon_i | x_i] = 0\), and \(\Er[x_ix_i' o_i]\) is nonsingular.

Identification

  1. Show that if \(\epsilon_i\) is independent of \(o_i\) conditional on \(x_i\), then \(\beta\) is identified.

  2. No longer assuming \(\epsilon_i\) is independent of \(o_i\), show that \(\beta\) is not identified by finding an observationally equivalent \(\tilde{\beta}\). [Hint: suppose \(\Er[o | x] < 1\) and consider \(\tilde{\epsilon} = \begin{cases} x_i' (\beta - \tilde{\beta}) + \epsilon_i & \text{ if } o_i = 1 \\ x_i'(\tilde{\beta} - \beta) \frac{\Er[o|x_i]}{\Er[1-o|x_i]} + \epsilon_i & \text{ if } o_i = 0 \end{cases}\). You should verify that this \(\tilde{\epsilon}\) still meets the assumption that \(\Er[\tilde{\epsilon} | x_i] = 0\).]

  3. Now suppose you observe \(z_i \in \R^k\) such that \(\Er[o_i z_i x_i']\) is nonsingular and \(\Er[\epsilon z_i | o_i = 1] = 0\). Show that \(\beta\) is identified.

Solution.

  1. Since \(\epsilon\) is independent of \(o_i\) given \(x_i\), we have \[ \Er[x_i o_i \epsilon ] = \Er[x_i o_i \Er[\epsilon_i | x_i, o_i]] = \Er[x_i o_i \Er[\epsilon_i | x_i]] = 0. \] We can use this for identification. \[ \begin{align*} \Er[x_i o_i (y_i - x_i'\beta)] & = 0 \\ \beta = & \Er[x_ix_i'o_i]^{-1} \Er[x_i y_i o_i] \end{align*} \]

  2. As suggested by the hint, given any observed \(\{y_i, x_i, o_i\}\) created by \(\beta\) and \(\epsilon_i\), note that \[ \begin{align*} y_i & = x_i'\beta + \epsilon_i \text{ if } o_i = 1 \\ & = x_i'\tilde{\beta} + \underbrace{x_i'(\beta - \tilde{\beta}) + \epsilon_i}_{\tilde{\epsilon}_i} \text{ if } o_i = 1 \end{align*} \] so changing \(\beta\) to \(\tilde{\beta}\) and \(\epsilon\) to \(\tilde{\epsilon}\) leaves the observed data unchanged. Also, \[ \begin{align*} \Er[\tilde{\epsilon} | x] & = \Er[o(x'(\beta-\tilde{\beta}) + (1-o)x'(\tilde{\beta} - \beta) \frac{\Er[o|x]}{\Er[1-o|x]} + \epsilon | x] \\ = & \Er[o|x](x'(\beta-\tilde{\beta}) + \Er[(1-o)|x]x'(\tilde{\beta} - \beta) \frac{\Er[o|x]}{\Er[1-o|x]} \\ = & 0 \end{align*} \] so \(\tilde{\epsilon}\) still meets the restriction that \(\Er[\tilde{\epsilon}|x]=0\).

  3. Similar to part 1, \[ \Er[o_i z_i \epsilon ] = \Er[o_i \Er[z_i \epsilon | o_i]] = 0 \] so \[ \begin{align*} \Er[z_i o_i (y_i - x_i'\beta)] & = 0 \\ \beta = & \Er[z_ix_i'o_i]^{-1} \Er[z_i y_i o_i] \end{align*} \] identifies \(\beta\).

Estimation

  1. Construct a sample estimator for \(\beta\) based on your answer to 1.a.1 and show that it is unbiased.

  2. Suppose that \(\epsilon_i \sim N(0, 1)\), independent of \(o_i\) and \(x_i\), and that \(P(o_i=1|x_i) = g(x_i'\alpha)\) for some known function \(g\) and unknown parameter \(\alpha\). Write the loglikelihood for \((\alpha,\beta)\), and show that \(\hat{\beta}^{MLE}\) does not depend on \(\alpha\). Remember that the normal pdf is \(\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\)

Solution.

  1. We have \[ \begin{align*} \Er[\hat{\beta}] & = \Er\left[ (\sum o_i x_i x_i')^{-1} (\sum o_i x_i y_i) \right] \\ & = \Er\left[ (\sum o_i x_i x_i')^{-1} (\sum o_i x_i (x_i'\beta + \epsilon_i)) \right] \\ & = \beta + \Er\left[ (\sum o_i x_i x_i')^{-1} (\sum o_i x_i \epsilon_i) \right] \\ & = \beta + \Er\left[ (\sum o_i x_i x_i')^{-1} (\sum o_i x_i \Er[\epsilon_i|x_1,...,x_n, o_1,...,o_n]) \right] \\ & = \beta \end{align*} \]

  2. The log likelihood is \[ \log \ell(\alpha,\beta) = \sum_{i=1}^n \left( o_i\left(\frac{-\log(2\pi)}{2} - \frac{(y_i - x_i'\beta)^2}{2} + \log(g(x_i'\alpha))\right) + (1-o_i)\log(1-g(x_i'\alpha)) \right) \] The first order condition for \(\hat{\beta}^{MLE}\) is \[ 0 = \sum o_i(y_i -x_i'\hat{\beta}^{MLE}) x_i \] which does not depend on \(\alpha\).

Efficiency

As in the previous part, suppose that \(\epsilon_i \sim N(0, 1)\), independent of \(o_i\) and \(x_i\), and that \(P(o_i=1|x_i) = g(x_i'\alpha)\) for some known function \(g\) and unknown parameter \(\alpha\).

  1. Derive the Cramér-Rao lower bound for \(\theta = (\alpha,\beta)\)

  2. Now, treat \(\alpha\) as known and derive the Cramér-Rao lower bound for just \(\beta\). Does knowing \(\alpha\) help with estimating \(\beta\)?

Solution.

  1. The score for \(\alpha\) is \(\frac{\partial}{\partial \alpha}\ell(\alpha,\beta) = \sum_{i=1}^n \frac{o_i}{g(x_i'\alpha)} g'(x_i'\alpha) x_i - \frac{1-o_i}{1-g(x_i'\alpha)} g'(x_i'\alpha) x_i\)
The Hessian is \[ H = \begin{pmatrix} -\sum_{i=1}^n x_i x_i' & 0 \\ 0 & A_n(\alpha) \end{pmatrix} \] where \(A_n(\alpha)\) is the derivative of the score for \(\alpha\). We can see from the score for \(\alpha\) that it does not depend on \(\beta\). The Cramer Rao lower bound is $^{-1} = \[\begin{pmatrix} \Er[n x_i x_i']^{-1} & 0 \\ 0 & A_n(\alpha)^{-1} \end{pmatrix}\]
  1. Now the Cramer Rao lower bound is just \(\Er[n x_i x_i']^{-1}\), which is the same as if \(\alpha\) were unknown. Knowing \(\alpha\) does not help estimate \(\beta\) in this situation.

Problem 2

Consider the probability space with sample space \(\Omega = \{a, b, c, d\}\), sigma field \(\mathscr{F} = 2^\Omega\), the power set of \(\Omega\), and probability measure \(P\).

Generated \(\sigma\)-fields

Let \(X(\omega) = \begin{cases} 1 & \text{ if } \omega = a \text{ or } b \\ 0 & \text{ otherwise} \end{cases}\). List the elements of \(\sigma(X)\).

Solution. \(\sigma(X) = \{ \{a,b\}, \{c,d\}, \Omega, \emptyset \}\).

Independence

Let \(Y(\omega) = \begin{cases} 1 & \text{ if } \omega = b \text{ or } c \\ 0 & \text{ otherwise} \end{cases}\). Can \(X\) and \(Y\) be independent?

Solution. Yes, for example if \(P(a) = P(b) = P(c) = P(d) = 1/4\), then \(P(X=x,Y=y) = 1/4 = P(X=x)P(Y=y)\) for \(x, y \in \{0,1\}^2\).

Testing

(For this and subsequent parts, ignore any restrictions on \(\theta_x\) and \(\theta_y\) implied by the sample space and form of \(X\) and \(Y\) in the first two parts).

Suppose you observe an independent and identically distributed sample of \(\{(x_i, y_i)\}_{i=1}^n\). For each \(i\), \(x_i = 1\) with probability \(\theta_x\) and 0 otherwise, and \(y_i = 1\) with probability \(\theta_y\) and 0 otherwise. Assume \(x_i\) is independent of \(y_i\).

  1. Find the most powerful test for testing \(H_0: (\theta_x, \theta_y) = (\theta_x^0, \theta_y^0)\) against \(H_1: (\theta_x, \theta_y) = (\theta_x^1, \theta_y^1)\).

  2. Show that there is a most powerful test for testing \(H_0: \theta_x = \theta_x^0\) against \(H_1: \theta_x = \theta_x^1\), where under the null and alternative, \(\theta_y\) is unrestricted.

Solution.

  1. By the Neyman-Pearson lemma, the likelihood ratio test is most powerful. The log likelihood ratio is \[ \begin{align*} \tau = \sum_{i=1}^n & x_i(\log(\theta_x^1) - \log(\theta_x^0)) + (1-x_i)(\log(1-\theta_x^1)-\log(1-\theta_x^0)) + \\ & + y_i(\log(\theta_y^1) - \log(\theta_y^0)) + (1-y_i)(\log(1-\theta_y^1)-\log(1-\theta_y^0)) \\ = & n_x(\log(\theta_x^1) - \log(\theta_x^0)) + (n-n_x)(\log(1-\theta_x^1)-\log(1-\theta_x^0)) + \\ & + n_y(\log(\theta_y^1) - \log(\theta_y^0)) + (n-n_y)(\log(1-\theta_y^1)-\log(1-\theta_y^0)) \\ \end{align*} \]

where \(n_x = \sum_{i=1}^n x_i\) and \(n_y = \sum_{i=1}^n y_i\). For a test of size \(\alpha\), we would find \(c\) such that \(P(\tau>c|H_0) = \alpha\) and reject if \(\tau > c\).

  1. We can interpret this as testing the point null and alternative \(H_0: \theta_x = \theta_x^0, \theta_y = \theta_y^0\), against \(H_1: \theta_x = \theta_x^1, \theta_y = \theta_y^1\) with \(\theta_y^1 = \theta_y^0\). In that case, the test statistic becomes \[ \tau = n_x(\log(\theta_x^1) - \log(\theta_x^0)) + (n-n_x)(\log(1-\theta_x^1)-\log(1-\theta_x^0)) \] and importantly does not depend on \(\theta_y\) nor \(n_y\). Thus, the same likelihood ratio test will be most powerful for any \(\theta_y\).

Behavior of Averages

  1. Note that \(\Er[x_i] = \theta_x\) and \(\Er[(x_i - \theta_x)^2] = \theta_x(1-\theta_x)\). Use Markov’s inequality to show that \[ P\left(\left\vert \frac{1}{n} \sum_{i=1}^n x_i - \theta_x \right\vert > \epsilon \right) \leq \frac{\theta_x(1-\theta_x)}{n \epsilon^2}. \]

  2. Show that \[ \lim_{n \to \infty} P\left(\left\vert \frac{1}{n} \sum_{i=1}^n x_i - \theta_x \right\vert > \epsilon \right) = 0. \]

Solution.

  1. This is Markov’s inequality with \(k=2\), because \(\Er[\left(\frac{1}{n} \sum_{i=1}^n x_i - \theta_x \right)^2] = \frac{\theta_x (1-\theta_x)}{n}\).

  2. Taking limits of the previous part directly gives this conclusion.

Definitions and Results

  • Measure and Probability:

    • A collection of subsets, \(\mathscr{F}\), of \(\Omega\) is a \(\sigma\)-field , if

      1. \(\Omega \in \mathscr{F}\)
      2. If \(A \in \mathscr{F}\), then \(A^c \in \mathscr{F}\)
      3. If \(A_1, A_2, ... \in \mathscr{F}\), then \(\cup_{j=1}^\infty A_j \in \mathscr{F}\)
    • A measure, \(\mu: \mathcal{F} \to [0, \infty]\) s.t.

      1. \(\mu(\emptyset) = 0\)
      2. If \(A_1, A_2, ... \in \mathscr{F}\) are pairwise disjoint, then \(\mu\left(\cup_{j=1}^\infty A_j \right) = \sum_{j=1}^\infty \mu(A_j)\)
    • Lesbegue integral is

      1. Positive: if \(f \geq 0\) a.e., then \(\int f d\mu \geq 0\)
      2. Linear: \(\int (af + bg) d\mu = a\int f d\mu + b \int g d \mu\)
    • Radon-Nikodym derivative: if \(\nu \ll \mu\), then \(\exists\) nonnegative measureable function, \(\frac{d\nu}{d\mu}\), s.t. \[ \nu(A) = \int_A \frac{d\nu}{d\mu} d\mu \]

    • Monotone convergence: If \(f_n:\Omega \to \mathbf{R}\) are measurable, \(f_{n}\geq 0\), and for each \(\omega \in \Omega\), \(f_{n}(\omega )\uparrow f(\omega )\), then \(\int f_{n}d\mu \uparrow \int fd\mu\) as \(n\rightarrow \infty\)

    • Dominated converegence: If \(f_n:\Omega \to \mathbf{R}\) are measurable, and for each \(\omega \in \Omega\), \(f_{n}(\omega )\rightarrow f(\omega ).\) Furthermore, for some \(g\geq 0\) such that \(\int gd\mu <\infty\), \(|f_{n}|\leq g\) for each \(n\geq 1\). Then, \(\int f_{n}d\mu \rightarrow \int fd\mu\)

    • Markov’s inequality: \(P(|X|>\epsilon) \leq \frac{\Er[|X|^k]}{\epsilon^k}\) \(\forall \epsilon > 0, k > 0\)

    • Jensen’s inequality: if \(g\) is convex, then \(g(\Er[X]) \leq \Er[g(X)]\)

    • Cauchy-Schwarz inequality: \(\left(\Er[XY]\right)^2 \leq \Er[X^2] \Er[Y^2]\)

    • \(\sigma(X)\) is \(\sigma\)-field generated by \(X\), it is

      • smallest \(\sigma\)-field w.r.t. which \(X\) is measurable
      • \(\sigma(X) = \{X^{-1}(B): B \in \mathscr{B}(\R)\}\)
    • \(\sigma(W) \subset \sigma(X)\) iff \(\exists\) \(g\) s.t. \(W = g(X)\)

    • Events \(A_1, ..., A_m\) are independent if for any sub-collection \(A_{i_1}, ..., A_{i_s}\) \[ P\left(\cap_{j=1}^s A_{i_j}\right) = \prod_{j=1}^s P(A_{i_j}) \] \(\sigma\)-fields are independent if this is true for any events from them. Random variables are independent if their \(\sigma\)-fields are.

    • Conditional expection of \(Y\) given \(\sigma\)-field \(\mathscr{G}\) satisfies \(\int_A \Er[Y|\mathscr{G}] dP = \int_A Y dP\) \(\forall A \in \mathscr{G}\)

  • Identification \(X\) observed, distribution \(P_X\), probability model \(\mathcal{P}\)

    • \(\theta_0 \in \R^k\) is identified in \(\mathcal{P}\) if there exists a known \(\psi: \mathcal{P} \to \R^k\) s.t. \(\theta_0 = \psi(P_X)\)
    • \(\mathcal{P} = \{ P(\cdot; s) : s \in S \}\), two structures \(s\) and \(\tilde{s}\) in \(S\) are observationally equivalent if they imply the same distribution for the observed data, i.e. \[ P(B;s) = P(B; \tilde{s}) \] for all \(B \in \sigma(X)\).
    • Let \(\lambda: S \to \R^k\), \(\theta\) is observationally equivalent to \(\tilde{\theta}\) if \(\exists s, \tilde{s} \in S\) that are observationally equivalent and \(\theta = \lambda(s)\) and \(\tilde{\theta} = \lambda(\tilde{s})\)
    • \(s_0 \in S\) is identified if there is no \(s \neq s_0\) that is observationally equivalent to \(s_0\)
    • \(\theta_0\) is identified (in \(S\)) if there is no observationally equivalent \(\theta \neq \theta_0\)
  • Cramér-Rao Bound: in the parametric model \(P_X \in \{P_\theta: \theta \in \R^d\}\) with likelihood \(\ell(\theta;x)\), if appropriate derivatives and integrals can be interchanged, then for any unbiased estimator \(\tau(X)\), \[ \var_\theta(\tau(X)) \geq I(\theta)^{-1} \] where \(I(\theta) = \int s(x,\theta) s(x,\theta)' dP_\theta(x) = \Er[H(x,\theta)]\) and \(s(x,\theta) = \frac{\partial \log \ell(\theta;x)}{\partial \theta}\)

  • Hypothesis testing:

    • \(P(\text{reject } H_0 | P_x \in \mathcal{P}_0)\)=Type I error rate \(=P_x(C)\)
    • \(P(\text{fail to reject } H_0 | P_x \in \mathcal{P}_1)\)=Type II error rate
    • \(P(\text{reject } H_0 | P_x \in \mathcal{P}_1)\) = power
    • \(\sup_{P_x \in \mathcal{P}_0} P_x(C)\) = size of test
    • Neyman-Pearson Lemma: Let \(\Theta = \{0, 1\}\), \(f_0\) and \(f_1\) be densities of \(P_0\) and \(P_1\), \(\tau(x) =f_1(x)/f_0(x)\) and \(C^* =\{x \in X: \tau(x) > c\}\). Then among all tests \(C\) s.t. \(P_0(C) = P_0(C^*)\), \(C^*\) is most powerful.
  • Projection: \(P_L y \in L\) is the projection of \(y\) on \(L\) if \[ \norm{y - P_L y } = \inf_{w \in L} \norm{y - w} \]

    1. \(P_L y\) exists, is unique, and is a linear function of \(y\)
    2. For any \(y_1^* \in L\), \(y_1^* = P_L y\) iff \(y- y_1^* \perp L\)
    3. \(G = P_L\) iff \(Gy = y \forall y \in L\) and \(Gy = 0 \forall y \in L^\perp\)
    4. Linear \(G: V \to V\) is a projection map onto its range, \(\mathcal{R}(G)\), iff \(G\) is idempotent and symmetric.
  • Gauss-Markov: \(Y = \theta + u\) with \(\theta \in L \subset \R^n\), a known subspace. If \(\Er[u] = 0\) and \(\Er[uu'] = \sigma^2 I_n\), then the best linear unbiased estimator (BLUE) of \(a'\theta = a'\hat{\theta}\) where \(\hat{\theta} = P_L y\)