14 - Panel Data Regressions

econ 490

panel data

regression

fixed-effects

random-effects

heteroskedasticity

serial correlation

causality

In this notebook, we go over panel data. We look into what it is, how to run regressions with panel data, as well as fixed and random-effects models. We finish by looking at some common mistakes when using panel data.

Author

Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt

Published

29 May 2024

Prerequisites

Run OLS Regressions.

Learning Outcomes

Prepare data for time-series analysis.
Run panel data regressions.
Create lagged variables.
Understand and work with fixed-effects.
Correct for heteroskedasticity and serial correlation.

15.0 Intro

This module uses the Penn World Tables which measure income, input, output, and productivity, covering 183 countries between 1950 and 2019. Before beginning this module, download this data in the .dta format.

14.1 What is Panel Data?

In economics, we typically have data consisting of many units observed at a particular point in time. This is called cross-sectional data. There may be several different versions of the data set that are collected over time (monthly, annually, etc.), but each version includes an entirely different set of individuals.

For example, let’s consider a Canadian cross-sectional data set: General Social Survey Cycle 31: Family, 2017. In this data set, the first observation is a 55 year old married woman who lives in Alberta with two children. When the General Social Survey Cycle 25: Family, 2011 was collected six years earlier, there were probably similar women surveyed, but it is extremely unlikely that this exact same woman was included in that data set as well. Even if she was included, we would have no way to match her data over the two years of the survey.

Cross-sectional data allows us to explore variation between individuals at one point in time but does not allow us to explore variation over time for those same individuals.

Time-series data sets contain observations over several years for only one unit, such as country, state, province, etc. For example, measures of income, output, unemployment, and fertility for Canada from 1960 to 2020 would be considered time-series data. Time-series data allows us to explore variation over time for one individual unit (e.g. Canada), but does not allow us to explore variation between individual units (i.e. multiple countries) at any one point in time.

Panel data allows us to observe the same unit across multiple time periods. For example, the Penn World Tables is a panel data set that measures income, output, input, and productivity, covering 183 countries from 1950 to the near present. There are also microdata panel data sets that follow the same people over time. One example is the Canadian National Longitudinal Survey of Children and Youth (NLSCY), which followed the same children from 1994 to 2010, surveying them every two years as they progressed from childhood to adulthood.

Panel data sets allow us to answer questions that we cannot answer with time-series and cross-sectional data. They allow us to simultaneously explore variation over time for individual countries (for example) and variation between individuals at one point in time. This approach is extremely productive for two reasons:

Panel data sets are large, much larger than if we were to use data collected at one point in time.
Panel data regressions control for variables that do not change over time and are difficult to measure, such as geography and culture.

In this sense, panel data sets allow us to answer empirical questions that cannot be answered with other types of data such as cross-sectional or time-series data.

Before we move forward exploring panel data sets in this module, we should understand the two main types of panel data:

A Balanced Panel is a panel data set in which we observe all units over all included time periods. Suppose we have a data set following the school outcomes of a select group of \(N\) children over \(T\) years. This is common in studies which investigate the effects of early childhood interventions on relevant outcomes over time. If the panel data set is balanced, we will see \(T\) observations for each child corresponding to the \(T\) years they have been tracked. As a result, our data set in total will have \(n = N*T\) observations.
An Unbalanced Panel is a panel data set in which we do not observe all units over all included time periods. Suppose in our data set tracking select children’s education outcomes over time, and that some children drop out of the study. This panel data set would be an unbalanced panel because it would necessarily have \(n < N*T\) observations, since the children who dropped out would not have observations for the years they were no longer in the study.

We learned the techniques to create a balanced panel in Module 6. Essentially, all that is needed is to create a new data set that includes only the years for which there are no missing values.

14.2 Preparing Our Data for Panel Analysis

The first step in any panel data analysis is to identify which variable is the panel variable and which variable is the time variable. The panel variable is the identifier of the units that are observed over time. The second step is indicating that information to R.

We are going to use the Penn World Data (discussed above) in this example. In that data set, the panel variable is either country or countrycode, and the time variable is year.

# Clear the memory from any pre-existing objects
rm(list=ls())

# Load packages
library(dplyr)

Warning: package 'dplyr' was built under R version 4.4.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)

Warning: package 'tidyr' was built under R version 4.4.2

library(haven)

Warning: package 'haven' was built under R version 4.4.2

# Import data (remember to change directory to the location of this data file)
#setwd()
pwt100 <- read_dta("../econ490-r/pwt100.dta")  #change me!

# Get summary of the data
summary(pwt100)

 countrycode          country          currency_unit           year     
 Length:12810       Length:12810       Length:12810       Min.   :1950  
 Class :character   Class :character   Class :character   1st Qu.:1967  
 Mode  :character   Mode  :character   Mode  :character   Median :1984  
                                                          Mean   :1984  
                                                          3rd Qu.:2002  
                                                          Max.   :2019  
                                                                        
     rgdpe              rgdpo               pop                 emp         
 Min.   :      20   Min.   :      28   Min.   :   0.0044   Min.   :  0.001  
 1st Qu.:    6802   1st Qu.:    7192   1st Qu.:   1.5797   1st Qu.:  0.775  
 Median :   30319   Median :   30844   Median :   6.1507   Median :  2.856  
 Mean   :  304852   Mean   :  307080   Mean   :  30.9630   Mean   : 14.171  
 3rd Qu.:  155974   3rd Qu.:  158739   3rd Qu.:  19.9342   3rd Qu.:  8.266  
 Max.   :20860506   Max.   :20595844   Max.   :1433.7837   Max.   :799.307  
 NA's   :2411       NA's   :2411       NA's   :2411        NA's   :3281     
      avh             hc             ccon               cda          
 Min.   :1381   Min.   :1.007   Min.   :      16   Min.   :      22  
 1st Qu.:1788   1st Qu.:1.450   1st Qu.:    5893   1st Qu.:    7194  
 Median :1972   Median :1.988   Median :   24654   Median :   31349  
 Mean   :1987   Mean   :2.087   Mean   :  224947   Mean   :  304946  
 3rd Qu.:2168   3rd Qu.:2.674   3rd Qu.:  113397   3rd Qu.:  153230  
 Max.   :3040   Max.   :4.352   Max.   :16826236   Max.   :21383552  
 NA's   :9318   NA's   :4173    NA's   :2411       NA's   :2411      
     cgdpe              cgdpo                cn                  ck       
 Min.   :      20   Min.   :      17   Min.   :       21   Min.   :0.000  
 1st Qu.:    6690   1st Qu.:    6812   1st Qu.:    14870   1st Qu.:0.001  
 Median :   30318   Median :   30497   Median :    84026   Median :0.004  
 Mean   :  306140   Mean   :  306162   Mean   :  1205994   Mean   :0.036  
 3rd Qu.:  155924   3rd Qu.:  155925   3rd Qu.:   544489   3rd Qu.:0.023  
 Max.   :20791364   Max.   :20566034   Max.   :101544168   Max.   :1.167  
 NA's   :2411       NA's   :2415       NA's   :2496        NA's   :5715   
      ctfp           cwtfp           rgdpna             rconna        
 Min.   :0.031   Min.   :0.073   Min.   :      14   Min.   :      19  
 1st Qu.:0.495   1st Qu.:0.504   1st Qu.:    7463   1st Qu.:    6425  
 Median :0.711   Median :0.703   Median :   33576   Median :   27671  
 Mean   :0.711   Mean   :0.702   Mean   :  327199   Mean   :  238597  
 3rd Qu.:0.886   3rd Qu.:0.860   3rd Qu.:  184404   3rd Qu.:  132622  
 Max.   :3.394   Max.   :2.016   Max.   :20572606   Max.   :16803152  
 NA's   :6398    NA's   :6398    NA's   :2411       NA's   :2411      
     rdana               rnna                rkna           rtfpna     
 Min.   :      24   Min.   :       28   Min.   :0.005   Min.   :0.200  
 1st Qu.:    7529   1st Qu.:    24335   1st Qu.:0.206   1st Qu.:0.794  
 Median :   33475   Median :   124997   Median :0.454   Median :0.951  
 Mean   :  317100   Mean   :  1447563   Mean   :0.488   Mean   :0.962  
 3rd Qu.:  178375   3rd Qu.:   747234   3rd Qu.:0.748   3rd Qu.:1.043  
 Max.   :21638008   Max.   :101703024   Max.   :1.975   Max.   :8.121  
 NA's   :2411       NA's   :2496        NA's   :5715    NA's   :6398   
    rwtfpna          labsh            irr            delta       
 Min.   :0.093   Min.   :0.090   Min.   :0.010   Min.   :0.0125  
 1st Qu.:0.795   1st Qu.:0.455   1st Qu.:0.063   1st Qu.:0.0336  
 Median :0.943   Median :0.537   Median :0.102   Median :0.0394  
 Mean   :0.945   Mean   :0.533   Mean   :0.124   Mean   :0.0421  
 3rd Qu.:1.036   3rd Qu.:0.623   3rd Qu.:0.159   3rd Qu.:0.0484  
 Max.   :3.571   Max.   :0.903   Max.   :1.096   Max.   :0.1000  
 NA's   :6398    NA's   :4840    NA's   :5270    NA's   :2496    
       xr               pl_con            pl_da            pl_gdpo        
 Min.   :       0   Min.   : 0.0181   Min.   : 0.0150   Min.   :-24.7077  
 1st Qu.:       1   1st Qu.: 0.1571   1st Qu.: 0.1601   1st Qu.:  0.1574  
 Median :       3   Median : 0.2855   Median : 0.2999   Median :  0.2921  
 Mean   :    7647   Mean   : 0.3568   Mean   : 0.3589   Mean   :  0.3708  
 3rd Qu.:      43   3rd Qu.: 0.4635   3rd Qu.: 0.4706   3rd Qu.:  0.4704  
 Max.   :76369942   Max.   :22.3648   Max.   :22.1145   Max.   : 24.1948  
 NA's   :2411       NA's   :2411      NA's   :2411      NA's   :2411      
     i_cig             i_xm             i_xr          i_outlier    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :1.0000   Median :0.0000   Median :0.000  
 Mean   :0.9099   Mean   :0.5803   Mean   :0.0567   Mean   :0.028  
 3rd Qu.:2.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.000  
 Max.   :4.0000   Max.   :2.0000   Max.   :1.0000   Max.   :1.000  
 NA's   :2411     NA's   :2411     NA's   :2411     NA's   :2411   
     i_irr          cor_exp          statcap          csh_c         
 Min.   :0.000   Min.   :-0.058   Min.   :14.44   Min.   :-11.8000  
 1st Qu.:0.000   1st Qu.: 0.425   1st Qu.:56.67   1st Qu.:  0.5358  
 Median :0.000   Median : 0.572   Median :67.78   Median :  0.6372  
 Mean   :0.143   Mean   : 0.558   Mean   :67.37   Mean   :  0.6433  
 3rd Qu.:0.000   3rd Qu.: 0.707   3rd Qu.:78.89   3rd Qu.:  0.7527  
 Max.   :3.000   Max.   : 1.000   Max.   :98.89   Max.   : 11.0561  
 NA's   :5270    NA's   :11157    NA's   :10693   NA's   :2411      
     csh_i             csh_g             csh_x             csh_m         
 Min.   :-2.9530   Min.   :-2.6147   Min.   :-1.9374   Min.   :-23.2376  
 1st Qu.: 0.1361   1st Qu.: 0.1233   1st Qu.: 0.0670   1st Qu.: -0.3796  
 Median : 0.2068   Median : 0.1695   Median : 0.1401   Median : -0.2003  
 Mean   : 0.2192   Mean   : 0.1912   Mean   : 0.2293   Mean   : -0.3008  
 3rd Qu.: 0.2766   3rd Qu.: 0.2343   3rd Qu.: 0.3003   3rd Qu.: -0.1027  
 Max.   : 8.1228   Max.   : 4.2894   Max.   : 3.5235   Max.   : 32.8740  
 NA's   :2411      NA's   :2411      NA's   :2411      NA's   :2411      
     csh_r               pl_c              pl_i              pl_g        
 Min.   :-12.5690   Min.   : 0.0156   Min.   : 0.0060   Min.   : 0.0093  
 1st Qu.: -0.0252   1st Qu.: 0.1712   1st Qu.: 0.1920   1st Qu.: 0.1161  
 Median :  0.0003   Median : 0.3063   Median : 0.3779   Median : 0.2446  
 Mean   :  0.0178   Mean   : 0.3709   Mean   : 0.4240   Mean   : 0.3456  
 3rd Qu.:  0.0445   3rd Qu.: 0.4845   3rd Qu.: 0.5571   3rd Qu.: 0.4530  
 Max.   :  7.5983   Max.   :23.1228   Max.   :34.4450   Max.   :18.4208  
 NA's   :2411       NA's   :2411      NA's   :2411      NA's   :2411     
      pl_x             pl_m             pl_n              pl_k       
 Min.   :0.0074   Min.   :0.0208   Min.   : 0.0130   Min.   : 0.064  
 1st Qu.:0.2377   1st Qu.:0.2408   1st Qu.: 0.1646   1st Qu.: 0.651  
 Median :0.4439   Median :0.4529   Median : 0.2959   Median : 0.955  
 Mean   :0.4098   Mean   :0.4034   Mean   : 0.3591   Mean   : 1.334  
 3rd Qu.:0.5570   3rd Qu.:0.5411   3rd Qu.: 0.4478   3rd Qu.: 1.415  
 Max.   :2.0561   Max.   :4.9904   Max.   :20.6492   Max.   :31.933  
 NA's   :2411     NA's   :2411     NA's   :2496      NA's   :5715

You may have noticed that the variable year is an integer (i.e. a number like 2010) and that country and countrycode are character variables (i.e. they are words like “Canada”). Specifying the panel and time variables requires that both of the variables we are using are coded as numeric variables. Moireover, we need to sort our data by the unique identifier (country or countrycode in our case) and tme variable (year).

# Order data according to countrycode and year, and call it df
df <- pwt100 %>% arrange(countrycode, year)

Now that we have sorted our data, we need to tell R that the data frame df contains panel data. We do so by relying on the package plm, a package containing various tools for Linear Models for Panel data. We load the package plm and use the pdata.frame() function to create a panel data frame. In the argument index of the function pdata.frame() we have to specify the name of the cross-sectional unit identifier (countrycode) and the time variable (year).

# Install and load plm package
#uncomment to install the package! install.packages("plm")
library(plm)


Attaching package: 'plm'

The following objects are masked from 'package:dplyr':

    between, lag, lead

# Convert dataframe to panel data format
panel_data <- pdata.frame(df, index=c("countrycode", "year"))

To check that we have correctly converted our data in a panel data frame, we can use the class or the pdim functions. Note that pdim tells us if our data frame is balanced or not, as well as the number of cross-sectional unit identifiers and time periods.

class(panel_data)

[1] "pdata.frame" "data.frame"

pdim(panel_data)

Balanced Panel: n = 183, T = 70, N = 12810

14.3 Basic Regressions with Panel Data

For now, we are going to focus on the skills we need to run our own panel data regressions. In section 14.6, there are more details about the econometrics of panel data regressions that may help with the understanding of these approaches. Please make sure you understand that theory before beginning your own research.

Now that we have specified the panel and time variables we are working with, we can begin to run regressions using our panel data. For panel data regressions, we simply replace lm with the command plm. The command plm takes another input, model. We can specify model to be fixed effect, random effect, or a pooled OLS. For now, let’s use a pooled OLS with model="pooling". More details on the other models will be addressed below.

Let’s try this out by regressing the natural log of GDP per capita on the natural log of human capital.

# Create the two new variables
panel_data <- panel_data %>% mutate(lngdp = log(rgdpo/pop), lnhc = log(hc))

# Estimate specification
model <- plm(lngdp ~ lnhc, data = panel_data, model = "pooling")
summary(model)

Pooling Model

Call:
plm(formula = lngdp ~ lnhc, data = panel_data, model = "pooling")

Unbalanced Panel: n = 145, T = 30-70, N = 8637

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-4.2715700 -0.4990806  0.0033994  0.4243113  4.7866659 

Coefficients:
            Estimate Std. Error t-value  Pr(>|t|)    
(Intercept) 6.979568   0.017750  393.22 < 2.2e-16 ***
lnhc        2.652185   0.023274  113.95 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    13012
Residual Sum of Squares: 5196.9
R-Squared:      0.60061
Adj. R-Squared: 0.60057
F-statistic: 12985.7 on 1 and 8635 DF, p-value: < 2.22e-16

The coefficients in a panel regression are interpreted similarly to those in a basic OLS regression. Because we have taken the natural log of our variables, we can interpret the coefficient on each explanatory variable as being a \(\beta\) % increase in the dependent variable associated with a 1% increase in the explanatory variable.

Thus, in the regression results above, a 1% increase in human capital leads to a roughly 2% increase in real GDP per capita. That’s a huge effect, but then again this model is almost certainly misspecified due to omitted variable bias. Namely, we are likely missing a number of explanatory variables that explain variation in both GDP per capita and human capital, such as savings and population growth rates.

One thing we know is that GDP per capita can be impacted by the individual characteristics of a country that do not change much over time. For example, it is known that distance from the equator has an impact on the standard of living of a country; countries that are closer to the equator are generally poorer than those farther from it. This is a time-invariant characteristic that we might want to control for in our regression. Similarly, we know that GDP per capita could be similarly impacted in many countries by a shock at one point in time. For example, a worldwide global recession would affect the GDP per capita of all countries at a given time such that values of GDP per capita in this time period are uniformly different in all countries from values in other periods. That seems like a time-variant characteristic (time trend) that we might want to control for in our regression. Fortunately, with panel data regressions, we can account for these sources of endogeneity. Let’s look at how panel data helps us do this.

14.3.1 Fixed-Effects Models

We refer to shocks that are invariant based on some variable (e.g. household level shocks that don’t vary with year or time-specific shocks that don’t vary with household) as fixed-effects. For instance, we can define household fixed-effects, time fixed-effects, and so on. Notice that this is an assumption on the error terms, and as such, when we include fixed-effects to our specification they become part of the model we assume to be true.

When we ran our regression of log real GDP per capita on log human capital from earlier, we were concerned about omitted variable bias and endogeneity. Specifically, we were concerned about distance from the equator positively impacting both human capital and real GDP per capita, in which case our measure of human capital would be correlated with our error term, preventing us from interpreting our regression result as causal. We are now able to add country fixed-effects to our regression to account for this and come closer to determining the pure effect of human capital on GDP growth. There are two ways to do this. Let’s look at the more obvious one first.

Approach 1: create a series of country dummy variables and include them in the regression. For example, we would have one dummy variable called “Canada” that would be equal to 1 if the country is Canada and 0 if not. We would have dummy variables for all but one of the countries in this data set to avoid perfect collinearity. Rather than define all of these dummies manually and include them in our regression command, we can simply factorize them and R will include them automatically.

# Factorize countrycode
panel_data <- panel_data %>% mutate(countrycode = factor(countrycode))

Now we can add the factorized version of country codes to our panel linear model.

model <- plm(lngdp ~ lnhc + countrycode, data = panel_data, model = "pooling")
summary(model)

Pooling Model

Call:
plm(formula = lngdp ~ lnhc + countrycode, data = panel_data, 
    model = "pooling")

Unbalanced Panel: n = 145, T = 30-70, N = 8637

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-4.484482 -0.223988  0.011389  0.235216  1.771284 

Coefficients:
                 Estimate Std. Error  t-value  Pr(>|t|)    
(Intercept)     7.9138700  0.0557836 141.8672 < 2.2e-16 ***
lnhc            2.0725367  0.0228576  90.6718 < 2.2e-16 ***
countrycodeALB -1.1675850  0.0801743 -14.5631 < 2.2e-16 ***
countrycodeARE  2.2663659  0.0796921  28.4390 < 2.2e-16 ***
countrycodeARG -0.8561933  0.0743902 -11.5095 < 2.2e-16 ***
countrycodeARM -1.4855325  0.0932710 -15.9271 < 2.2e-16 ***
countrycodeAUS -0.0942301  0.0760690  -1.2387 0.2154737    
countrycodeAUT -0.1109967  0.0753938  -1.4722 0.1409969    
countrycodeBDI -1.5205658  0.0752978 -20.1940 < 2.2e-16 ***
countrycodeBEL  0.0601811  0.0749744   0.8027 0.4221768    
countrycodeBEN -0.8578009  0.0750248 -11.4336 < 2.2e-16 ***
countrycodeBFA -0.8977980  0.0753508 -11.9149 < 2.2e-16 ***
countrycodeBGD -1.2420033  0.0751348 -16.5303 < 2.2e-16 ***
countrycodeBGR -0.7604451  0.0809396  -9.3952 < 2.2e-16 ***
countrycodeBHR  0.8210965  0.0794591  10.3336 < 2.2e-16 ***
countrycodeBLZ -1.3435306  0.0809110 -16.6050 < 2.2e-16 ***
countrycodeBOL -1.3474223  0.0736960 -18.2835 < 2.2e-16 ***
countrycodeBRA -0.4142285  0.0733045  -5.6508 1.649e-08 ***
countrycodeBRB -0.1897636  0.0770783  -2.4620 0.0138378 *  
countrycodeBRN  1.5033023  0.0800975  18.7684 < 2.2e-16 ***
countrycodeBWA -0.8169212  0.0759202 -10.7603 < 2.2e-16 ***
countrycodeCAF -1.3071977  0.0752984 -17.3602 < 2.2e-16 ***
countrycodeCAN -0.0235711  0.0759494  -0.3104 0.7563007    
countrycodeCHE  0.1159737  0.0763893   1.5182 0.1290031    
countrycodeCHL -0.6640074  0.0747736  -8.8802 < 2.2e-16 ***
countrycodeCHN -1.2519404  0.0738216 -16.9590 < 2.2e-16 ***
countrycodeCIV -0.5281155  0.0753013  -7.0134 2.505e-12 ***
countrycodeCMR -0.9435661  0.0754434 -12.5069 < 2.2e-16 ***
countrycodeCOD -1.1363224  0.0728257 -15.6033 < 2.2e-16 ***
countrycodeCOG -0.7672163  0.0755483 -10.1553 < 2.2e-16 ***
countrycodeCOL -0.4126523  0.0735289  -5.6121 2.062e-08 ***
countrycodeCRI -0.3754629  0.0737845  -5.0886 3.684e-07 ***
countrycodeCYP -0.0297053  0.0740581  -0.4011 0.6883508    
countrycodeCZE -0.3349684  0.0940311  -3.5623 0.0003696 ***
countrycodeDEU -0.3038157  0.0760558  -3.9946 6.534e-05 ***
countrycodeDNK -0.0638861  0.0757018  -0.8439 0.3987393    
countrycodeDOM -0.5496053  0.0736276  -7.4647 9.177e-14 ***
countrycodeDZA  0.4004324  0.0755001   5.3037 1.163e-07 ***
countrycodeECU -0.6859266  0.0738495  -9.2882 < 2.2e-16 ***
countrycodeEGY -1.0284459  0.0730225 -14.0840 < 2.2e-16 ***
countrycodeESP  0.0026080  0.0743050   0.0351 0.9720016    
countrycodeEST -0.5866866  0.0936584  -6.2641 3.932e-10 ***
countrycodeETH -1.4629725  0.0753127 -19.4253 < 2.2e-16 ***
countrycodeFIN -0.0552750  0.0751749  -0.7353 0.4621861    
countrycodeFJI -0.7626917  0.0763733  -9.9864 < 2.2e-16 ***
countrycodeFRA  0.1069984  0.0748937   1.4287 0.1531357    
countrycodeGAB  0.2872629  0.0756826   3.7956 0.0001483 ***
countrycodeGBR -0.2046963  0.0758355  -2.6992 0.0069641 ** 
countrycodeGHA -0.7666578  0.0743406 -10.3128 < 2.2e-16 ***
countrycodeGMB -0.3783083  0.0752947  -5.0244 5.154e-07 ***
countrycodeGRC -0.1195681  0.0746092  -1.6026 0.1090619    
countrycodeGTM -0.3600826  0.0729039  -4.9391 7.996e-07 ***
countrycodeGUY -1.0711044  0.0800011 -13.3886 < 2.2e-16 ***
countrycodeHKG  0.3536698  0.0770520   4.5900 4.496e-06 ***
countrycodeHND -0.9253552  0.0731616 -12.6481 < 2.2e-16 ***
countrycodeHRV -0.5280944  0.0932547  -5.6629 1.537e-08 ***
countrycodeHTI -1.1533324  0.0753401 -15.3084 < 2.2e-16 ***
countrycodeHUN -0.5205110  0.0811291  -6.4158 1.475e-10 ***
countrycodeIDN -1.0262548  0.0758132 -13.5366 < 2.2e-16 ***
countrycodeIND -1.1806809  0.0729240 -16.1906 < 2.2e-16 ***
countrycodeIRL -0.1043487  0.0749147  -1.3929 0.1636864    
countrycodeIRN  0.1575228  0.0740786   2.1264 0.0334964 *  
countrycodeIRQ -0.0331738  0.0789483  -0.4202 0.6743529    
countrycodeISL  0.3615297  0.0746757   4.8413 1.313e-06 ***
countrycodeISR -0.2988861  0.0756130  -3.9528 7.786e-05 ***
countrycodeITA  0.1210816  0.0744401   1.6266 0.1038666    
countrycodeJAM -0.8681193  0.0747782 -11.6093 < 2.2e-16 ***
countrycodeJOR -0.7779743  0.0743328 -10.4661 < 2.2e-16 ***
countrycodeJPN -0.4223238  0.0757315  -5.5766 2.528e-08 ***
countrycodeKAZ -0.7565803  0.0932048  -8.1174 5.430e-16 ***
countrycodeKEN -1.1797023  0.0730714 -16.1445 < 2.2e-16 ***
countrycodeKGZ -1.9247221  0.0931645 -20.6594 < 2.2e-16 ***
countrycodeKHM -1.3891325  0.0787891 -17.6310 < 2.2e-16 ***
countrycodeKOR -0.8586359  0.0754249 -11.3840 < 2.2e-16 ***
countrycodeKWT  1.6825944  0.0793737  21.1984 < 2.2e-16 ***
countrycodeLAO -1.2505977  0.0788503 -15.8604 < 2.2e-16 ***
countrycodeLBR -1.5649421  0.0765941 -20.4316 < 2.2e-16 ***
countrycodeLKA -1.2292810  0.0741029 -16.5888 < 2.2e-16 ***
countrycodeLSO -1.4068444  0.0757763 -18.5658 < 2.2e-16 ***
countrycodeLTU -0.4579487  0.0931202  -4.9178 8.915e-07 ***
countrycodeLUX  0.6663585  0.0745599   8.9372 < 2.2e-16 ***
countrycodeLVA -0.4205204  0.0929532  -4.5240 6.151e-06 ***
countrycodeMAC  0.8761202  0.0796854  10.9947 < 2.2e-16 ***
countrycodeMAR -0.3318477  0.0728292  -4.5565 5.273e-06 ***
countrycodeMDA -1.7135121  0.0930228 -18.4203 < 2.2e-16 ***
countrycodeMDG -1.2894520  0.0753642 -17.1096 < 2.2e-16 ***
countrycodeMDV -0.1380270  0.0790971  -1.7450 0.0810154 .  
countrycodeMEX -0.0596503  0.0737608  -0.8087 0.4187113    
countrycodeMLI -1.2177866  0.0753184 -16.1685 < 2.2e-16 ***
countrycodeMLT -0.6047980  0.0752351  -8.0388 1.029e-15 ***
countrycodeMMR -1.4704498  0.0759577 -19.3588 < 2.2e-16 ***
countrycodeMNG -1.5502639  0.0801544 -19.3410 < 2.2e-16 ***
countrycodeMOZ -1.4025134  0.0753100 -18.6232 < 2.2e-16 ***
countrycodeMRT -0.4266716  0.0753664  -5.6613 1.551e-08 ***
countrycodeMUS -0.0791481  0.0734580  -1.0775 0.2813047    
countrycodeMWI -1.6812900  0.0738461 -22.7675 < 2.2e-16 ***
countrycodeMYS -0.3608377  0.0749055  -4.8172 1.481e-06 ***
countrycodeNAM -0.4384732  0.0759446  -5.7736 8.033e-09 ***
countrycodeNER -0.8025277  0.0753428 -10.6517 < 2.2e-16 ***
countrycodeNGA -0.8191766  0.0753410 -10.8729 < 2.2e-16 ***
countrycodeNIC -0.4132996  0.0731293  -5.6516 1.641e-08 ***
countrycodeNLD  0.0442161  0.0754439   0.5861 0.5578383    
countrycodeNOR  0.0543472  0.0758195   0.7168 0.4735190    
countrycodeNPL -1.2820333  0.0753016 -17.0253 < 2.2e-16 ***
countrycodeNZL -0.3006864  0.0759628  -3.9583 7.609e-05 ***
countrycodePAK -0.7906197  0.0728637 -10.8507 < 2.2e-16 ***
countrycodePAN -0.6960320  0.0740998  -9.3932 < 2.2e-16 ***
countrycodePER -0.9140057  0.0737523 -12.3929 < 2.2e-16 ***
countrycodePHL -1.2087071  0.0737143 -16.3972 < 2.2e-16 ***
countrycodePOL -0.6659559  0.0810286  -8.2188 2.359e-16 ***
countrycodePRT  0.3215890  0.0732968   4.3875 1.160e-05 ***
countrycodePRY -0.8082199  0.0737547 -10.9582 < 2.2e-16 ***
countrycodeQAT  1.7166253  0.0796820  21.5435 < 2.2e-16 ***
countrycodeROU -0.9929519  0.0772920 -12.8468 < 2.2e-16 ***
countrycodeRUS -0.5689842  0.0933972  -6.0921 1.163e-09 ***
countrycodeRWA -1.3701320  0.0753120 -18.1928 < 2.2e-16 ***
countrycodeSAU  0.9389692  0.0795731  11.8001 < 2.2e-16 ***
countrycodeSDN -0.6274889  0.0786598  -7.9773 1.691e-15 ***
countrycodeSEN -0.4137599  0.0752945  -5.4952 4.015e-08 ***
countrycodeSGP  0.4528529  0.0765040   5.9193 3.358e-09 ***
countrycodeSLE -1.1958999  0.0755900 -15.8209 < 2.2e-16 ***
countrycodeSLV -1.4832569  0.0730552 -20.3032 < 2.2e-16 ***
countrycodeSRB -0.9042735  0.0930796  -9.7151 < 2.2e-16 ***
countrycodeSVK -0.5414364  0.0938912  -5.7666 8.370e-09 ***
countrycodeSVN -0.2446126  0.0936807  -2.6111 0.0090402 ** 
countrycodeSWE  0.0014721  0.0755478   0.0195 0.9844540    
countrycodeSWZ -0.2669640  0.0789890  -3.3798 0.0007288 ***
countrycodeSYR -1.1126709  0.0757991 -14.6792 < 2.2e-16 ***
countrycodeTGO -1.1820787  0.0753780 -15.6820 < 2.2e-16 ***
countrycodeTHA -0.7933917  0.0733773 -10.8125 < 2.2e-16 ***
countrycodeTJK -2.3864821  0.0932714 -25.5864 < 2.2e-16 ***
countrycodeTTO -0.0485306  0.0744453  -0.6519 0.5144856    
countrycodeTUN -0.2149053  0.0755464  -2.8447 0.0044563 ** 
countrycodeTUR  0.2710639  0.0731255   3.7068 0.0002112 ***
countrycodeTWN -0.0768686  0.0742117  -1.0358 0.3003239    
countrycodeTZA -1.2984080  0.0753746 -17.2261 < 2.2e-16 ***
countrycodeUGA -1.6358949  0.0729136 -22.4361 < 2.2e-16 ***
countrycodeUKR -1.2238056  0.0933087 -13.1157 < 2.2e-16 ***
countrycodeURY -0.2971627  0.0740501  -4.0130 6.047e-05 ***
countrycodeUSA  0.0665249  0.0762303   0.8727 0.3828604    
countrycodeVEN -0.1055409  0.0733484  -1.4389 0.1502159    
countrycodeVNM -1.6612024  0.0794020 -20.9214 < 2.2e-16 ***
countrycodeYEM -0.9614946  0.0899015 -10.6950 < 2.2e-16 ***
countrycodeZAF -0.2013623  0.0736927  -2.7325 0.0062993 ** 
countrycodeZMB -1.5310631  0.0744627 -20.5615 < 2.2e-16 ***
countrycodeZWE -1.0478764  0.0741844 -14.1253 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    13012
Residual Sum of Squares: 1312.8
R-Squared:      0.89911
Adj. R-Squared: 0.89739
F-statistic: 521.85 on 145 and 8491 DF, p-value: < 2.22e-16

The problem with this approach is that we end up with a huge table containing the coefficients of every country dummy, none of which we care about. We are interested in the relationship between GDP and human capital, not the mean values of GDP for each country relative to the omitted one. Luckily for us, a well-known result is that controlling for fixed-effects is equivalent to adding multiple dummy variables. This leads us into the second approach to including fixed-effects in a regression.

Approach 2: We can alternatively apply fixed affects to the regression by adding model="within" as an option on the regression.

model <- plm(lngdp ~ lnhc, data = panel_data, model = "within")
summary(model)

Oneway (individual) effect Within Model

Call:
plm(formula = lngdp ~ lnhc, data = panel_data, model = "within")

Unbalanced Panel: n = 145, T = 30-70, N = 8637

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-4.484482 -0.223988  0.011389  0.235216  1.771284 

Coefficients:
     Estimate Std. Error t-value  Pr(>|t|)    
lnhc 2.072537   0.022858  90.672 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    2584
Residual Sum of Squares: 1312.8
R-Squared:      0.49193
Adj. R-Squared: 0.48326
F-statistic: 8221.37 on 1 and 8491 DF, p-value: < 2.22e-16

We obtained the same coefficient and standard errors on our explanatory variable using both approaches!

14.3.2 Random-Effects Models

One type of model we can also run is a random-effects model. The main difference between a random and fixed-effects model is that, with the random-effects model, differences across countries are assumed to be random. This allows us to treat time-invariant variables such as latitude as control variables. To run a random-effects model, just add model="random" as argument of plm.

model <- plm(lngdp ~ lnhc, data = panel_data, model = "random")
summary(model)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = lngdp ~ lnhc, data = panel_data, model = "random")

Unbalanced Panel: n = 145, T = 30-70, N = 8637

Effects:
                 var std.dev share
idiosyncratic 0.1546  0.3932 0.261
individual    0.4376  0.6615 0.739
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8921  0.9235  0.9270  0.9237  0.9291  0.9291 

Residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.4569 -0.2345  0.0111  0.0009  0.2386  2.0128 

Coefficients:
            Estimate Std. Error z-value  Pr(>|z|)    
(Intercept) 7.343140   0.057390 127.951 < 2.2e-16 ***
lnhc        2.082794   0.022704  91.739 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    2706.4
Residual Sum of Squares: 1340.1
R-Squared:      0.50484
Adj. R-Squared: 0.50479
Chisq: 8416.01 on 1 DF, p-value: < 2.22e-16

As we can see, with this data and choice of variables, there is little difference in results between all of these models.

This, however, will not always be the case. The test to determine if you should use the fixed-effects model or the random-effects model is called the Hausman test.

To run this test in R, we first have to store the fixed-effect and the random-effect models in two different objects, one called fixed and the other called random.

fixed <- plm(lngdp ~ lnhc, data = panel_data, model = "within")
random <- plm(lngdp ~ lnhc, data = panel_data, model = "random")

Then, we perform the Hausman test by comparing the two objects fixed and random using the function phtest. Remember, the null hypothesis is that the preferred model is random-effects.

phtest(fixed, random)


    Hausman Test

data:  lngdp ~ lnhc
chisq = 14.991, df = 1, p-value = 0.000108
alternative hypothesis: one model is inconsistent

As you can see, the p-values associated with this test suggest that we would reject the null hypothesis (random effect) and that we should adopt a fixed-effects model.

14.3.3 What if We Want to Control for Multiple Fixed-Effects?

Let’s say we have run a panel data regression with fixed-effects, and we think that no more needs to be done to control for factors that are constant across our cross-sectional variables (i.e. countries) at any one point in time (i.e. years). However, for very long series (for example those over 20 years), we will want to check that time dummy variables are not also needed.

In R, we can easily do it using two functions: the pFtest() and the plmtest().

First, let’s save our models with and without time fixed-effects in two objects.

# No time fixed-effects
fixed <- plm(lngdp ~ lnhc, data = panel_data, model = "within")

# Time fixed-effects
fixed_yearfe <- plm(lngdp ~ lnhc + factor(year), data = panel_data, model = "within")

Now that we have saved both models, we can use the test. pFtest() requires us to use both models as inputs. plmtest() only needs the model without time fixed-effects as input.

# Option 1: pFtest
pFtest(fixed_yearfe, fixed)


    F test for individual effects

data:  lngdp ~ lnhc + factor(year)
F = 21.423, df1 = 69, df2 = 8422, p-value < 2.2e-16
alternative hypothesis: significant effects

# Option 2: plmtest
plmtest(fixed, c("time"), type=("bp"))


    Lagrange Multiplier Test - time effects (Breusch-Pagan)

data:  lngdp ~ lnhc
chisq = 573.17, df = 1, p-value < 2.2e-16
alternative hypothesis: significant effects

Both tests report a p-value smaller than 0.05, which suggests that we can reject the null hypothesis and need time-fixed-effects in our model.

15.4 Creating New Panel Variables

Panel data also provides us with a new source of variation: variation over time. This means that we have access to a wide variety of variables we can include. For instance, we can create lags (variables in previous periods) and leads (variables in future periods). Once we have defined our panel data set using the pdata.frame function (which we did earlier), we can create the lags using the dplyr::lag() function and the leads using the dplyr::lead() function.

Warning: Many other packages have a lag() and a lead() function. To make sure that R knows which function you want to use, specify that the source library is dplyr by writing the functions in their full names: dplyr::lag() and dplyr::lead(). Failing to do so may result in lag() and lead() not to behave as expected.

For example, let’s create a new variable that lags the natural log of GDP per capita by one period.

panel_data <- panel_data %>% mutate(lag1_lngdp = dplyr::lag(lngdp,1))

If we wanted to lag this same variable ten periods, we would write it as such:

panel_data <- panel_data %>% mutate(lag10_lngdp = dplyr::lag(lngdp,10))

Let’s inspect the first 50 rows of our data frame to check that we have created lagged variables as expected.

head(panel_data[, c("lngdp", "lag1_lngdp", "lag10_lngdp")],50)

             lngdp lag1_lngdp lag10_lngdp
ABW-1950        NA         NA          NA
ABW-1951        NA         NA          NA
ABW-1952        NA         NA          NA
ABW-1953        NA         NA          NA
ABW-1954        NA         NA          NA
ABW-1955        NA         NA          NA
ABW-1956        NA         NA          NA
ABW-1957        NA         NA          NA
ABW-1958        NA         NA          NA
ABW-1959        NA         NA          NA
ABW-1960        NA         NA          NA
ABW-1961        NA         NA          NA
ABW-1962        NA         NA          NA
ABW-1963        NA         NA          NA
ABW-1964        NA         NA          NA
ABW-1965        NA         NA          NA
ABW-1966        NA         NA          NA
ABW-1967        NA         NA          NA
ABW-1968        NA         NA          NA
ABW-1969        NA         NA          NA
ABW-1970  8.967534         NA          NA
ABW-1971  9.046721   8.967534          NA
ABW-1972  9.126887   9.046721          NA
ABW-1973  9.205403   9.126887          NA
ABW-1974  9.285424   9.205403          NA
ABW-1975  9.370663   9.285424          NA
ABW-1976  9.458719   9.370663          NA
ABW-1977  9.551412   9.458719          NA
ABW-1978  9.640923   9.551412          NA
ABW-1979  9.727948   9.640923          NA
ABW-1980  9.812438   9.727948    8.967534
ABW-1981  9.892118   9.812438    9.046721
ABW-1982  9.966508   9.892118    9.126887
ABW-1983 10.039509   9.966508    9.205403
ABW-1984 10.116726  10.039509    9.285424
ABW-1985 10.197985  10.116726    9.370663
ABW-1986 10.292268  10.197985    9.458719
ABW-1987 10.453691  10.292268    9.551412
ABW-1988 10.636786  10.453691    9.640923
ABW-1989 10.681402  10.636786    9.727948
ABW-1990 10.788750  10.681402    9.812438
ABW-1991 10.803037  10.788750    9.892118
ABW-1992 10.807625  10.803037    9.966508
ABW-1993 10.839805  10.807625   10.039509
ABW-1994 10.904202  10.839805   10.116726
ABW-1995 10.860747  10.904202   10.197985
ABW-1996 10.774814  10.860747   10.292268
ABW-1997 10.821990  10.774814   10.453691
ABW-1998 10.885563  10.821990   10.636786
ABW-1999 10.906652  10.885563   10.681402

We can include lagged variables directly in our regression if we believe that past values of real GDP per capita influence current levels of real GDP per capita.

model <- plm(lngdp ~ lnhc + lag10_lngdp, data = panel_data, model = "within")
summary(model)

Oneway (individual) effect Within Model

Call:
plm(formula = lngdp ~ lnhc + lag10_lngdp, data = panel_data, 
    model = "within")

Unbalanced Panel: n = 145, T = 20-70, N = 7851

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-4.530241 -0.176933  0.012941  0.205479  1.589685 

Coefficients:
             Estimate Std. Error t-value  Pr(>|t|)    
lnhc        1.9619949  0.0233987  83.850 < 2.2e-16 ***
lag10_lngdp 0.1932958  0.0067807  28.507 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    2205.7
Residual Sum of Squares: 964.66
R-Squared:      0.56266
Adj. R-Squared: 0.55437
F-statistic: 4955.71 on 2 and 7704 DF, p-value: < 2.22e-16

While we included lags from the previous period and 10 periods back as examples, we can use any period for our lags. In fact, including lag variables as controls for recent periods such as one lag back and two lags back is the most common choice for inclusion of past values of independent variables as controls.

Finally, these variables are useful if we are trying to measure the growth rate of a variable. Recall that the growth rate of a variable X is just equal to \(ln(X_{t}) - ln(X_{t-1})\) where the subscripts indicate time.

For example, if we want to now include the natural log of the population growth rate in our regression, we can create that new variable by taking the natural log of the population growth rate \(ln(pop_{t}) - ln(pop_{t-1})\)

# Create log of population
panel_data$lnpop <- log(panel_data$pop)

# Create the population growth rate
panel_data <- panel_data %>% mutate(lnn = lnpop - dplyr::lag(lnpop,1))

Another variable that might also be useful is the natural log of the growth rate of GDP per capita.

panel_data <- panel_data %>% mutate(dlngdp = lngdp - dplyr::lag(lngdp,1))

Let’s put this all together in a regression to see the effect of the growth rate of population on growth rate of GDP per capita, controlling for human capital and the level of GDP per capita in the previous year:

model <- plm(dlngdp ~ lag1_lngdp + lnn + lnhc, data = panel_data, model = "within")
summary(model)

Oneway (individual) effect Within Model

Call:
plm(formula = dlngdp ~ lag1_lngdp + lnn + lnhc, data = panel_data, 
    model = "within")

Unbalanced Panel: n = 145, T = 29-70, N = 8548

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-3.423229 -0.034369  0.008816  0.052629  2.279155 

Coefficients:
             Estimate Std. Error t-value  Pr(>|t|)    
lag1_lngdp -0.1265769  0.0041663 -30.381 < 2.2e-16 ***
lnn        -0.0998352  0.0079216 -12.603 < 2.2e-16 ***
lnhc        0.3186294  0.0124534  25.586 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    234.83
Residual Sum of Squares: 205.69
R-Squared:      0.1241
Adj. R-Squared: 0.10877
F-statistic: 396.704 on 3 and 8400 DF, p-value: < 2.22e-16

14.5 Is Our Panel Data Regression Properly Specified?

While there are the typical concerns with interpreting the coefficients of regressions (i.e. multicollinearity, inferring causality), there are some topics which require special treatment when working with panel data.

14.5.1 Heteroskedasticity

As always, when running regressions, we must consider whether our residuals are heteroskedastic (not constant for all values of \(X\)). To test our panel data regression for heteroskedasticity in the residuals, we need to calculate a modified Wald statistic. We use the Breusch-Pagan test that can be found in the lmtest package.

library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Once we have loaded the lmtest package, we can call the Breusch-Pagan test in the bptest() function. The first argument of bptest() is the model we want to test; in our case, it is the specification for log GDP and log human capital. The second argument is the data frame.

bptest(lngdp ~ lnhc + countrycode, data = panel_data)


    studentized Breusch-Pagan test

data:  lngdp ~ lnhc + countrycode
BP = 1475.6, df = 145, p-value < 2.2e-16

The null hypothesis is homoskedasticity (or constant variance of the error term). From the output above, we can see that we reject the null hypothesis and conclude that the residuals in this regression are heteroskedastic.

We can control for heteroskedasticity in different ways when we use a fixed-effects model. The coeftest() function allows us to estimate several heteroskedasticity-consistent covariance estimators.

# Estimate model
fixed <- plm(lngdp ~ lnhc, data = panel_data, model="within")

# Show original coefficients
coeftest(fixed)


t test of coefficients:

     Estimate Std. Error t value  Pr(>|t|)    
lnhc 2.072537   0.022858  90.672 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Show heteroskedasticity consistent coefficients
coeftest(fixed, vcovHC)


t test of coefficients:

     Estimate Std. Error t value  Pr(>|t|)    
lnhc  2.07254    0.15478   13.39 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.5.2 Serial Correlation

In time-series setups where we only observe a single unit over time (no cross-sectional dimension) we might be worried that a linear regression model like

\[ Y_t = \alpha + \beta X_t + \varepsilon_t \]

can have errors that not only are heteroskedastic (i.e. that depend on observables \(X_t\)) but can also be correlated across time. For instance, if \(Y_t\) was income, then \(\varepsilon_t\) may represent income shocks (including transitory and permanent components). The permanent income shocks are, by definition, very persistent over time. This would mean that \(\varepsilon_{t-1}\) affects (and thus is correlated with) shocks in the next period \(\varepsilon_t\). This problem is called serial correlation or autocorrelation, and if it exists, the assumptions of the regression model (i.e. unbiasedness, consistency, etc.) are violated. This can take the form of regressions where a variable is correlated with lagged versions of the same variable.

To test our panel data regression for serial correlation, we need to run a Breusch-Godfrey/Woolridge test. In R, we can do it easily with pbgtest().

# Estimate model
fixed <- plm(lngdp ~ lnhc, data = panel_data, model="within")

# Run test
pbgtest(fixed)


    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  lngdp ~ lnhc
chisq = 7628.1, df = 30, p-value < 2.2e-16
alternative hypothesis: serial correlation in idiosyncratic errors

The null hypothesis is that there is no serial correlation between residuals. From the output, we see that we cannot reject the null hypothesis and conclude the variables are correlated with lagged versions of themselves. One method for dealing with this serial correlation in panel data regression is by using again the coeftest() function, this time with the Arellano method of computing the covariance matrix. Note that the Arellano method allows a fully general structure with respect to both heteroskedasticity and serial correlation, so that our standard errors would effectively be robust to both threats.

# Estimate model
fixed <- plm(lngdp ~ lnhc, data = panel_data, model="within")

# Show original coefficients
coeftest(fixed)


t test of coefficients:

     Estimate Std. Error t value  Pr(>|t|)    
lnhc 2.072537   0.022858  90.672 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Show heteroskedasticity and serial correlation consistent coefficients
coeftest(fixed, vcovHC(fixed, method="arellano"))


t test of coefficients:

     Estimate Std. Error t value  Pr(>|t|)    
lnhc  2.07254    0.15478   13.39 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.5.3 Granger Causality

In the regressions that we have been running in this example, we have found that the level of human capital is correlated with the level of GDP per capita. But have we proven that having high human capital causes countries to be wealthier? Or is is possible that wealthier countries can afford to invest in human capital? This is known as the issue of reverse causality, and arises when our independent variable determines our dependent variable.

The Granger Causality test allows use to unpack some of the causality in these regressions. While understanding how this test works is beyond the scope of this notebook, we can look at an example using this data.

The first thing we need to do is ensure that our panel is balanced. In the Penn World Tables, there are no missing values for real GDP and for population, but there are missing values for human capital. We can balance our panel by simply dropping all of the observations that do not include that measure.

panel_data <- panel_data %>%
            drop_na(lnhc)

Next, we can run the test that is provided by R for Granger Causality: grangertest(). The first input is the model we want to use, the second input is the data, and the optional third input is the number of lags we want to use (by default, R uses only 1 lag).

granger_test <- grangertest(lngdp ~ lnhc, data = panel_data, order=3)
print(granger_test)

Granger causality test

Model 1: lngdp ~ Lags(lngdp, 1:3) + Lags(lnhc, 1:3)
Model 2: lngdp ~ Lags(lngdp, 1:3)
  Res.Df Df      F  Pr(>F)  
1   8627                    
2   8630 -3 2.7117 0.04337 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that R gives us two models. In model 1, both previous values of GDP and human capital are included: this is an unrestricted model that includes all Granger-causal terms. In model 2, the Granger-causal terms are omitted and only previous values of GDP are included.

From our results, we can reject the null hypothesis of lack of Granger causality. The evidence seems to suggest that high levels of human capital cause countries to be wealthier.

Please speak to your instructor, supervisor, or TA if you need help with this test.

14.6 How is Panel Data Helpful?

In typical cross-sectional settings, it is hard to defend the selection on observables assumption (otherwise known as conditional independence). However, panel data allows us to control for unobserved time-invariant heterogeneity.

Consider the following example. Household income \(y_{jt}\) at time \(t\) can be split into two components:

\[ y_{jt} = e_{jt} + \Psi_{j} \]

where \(\Psi_{j}\) is a measure of unobserved household-level determinants of income, such as social programs targeted towards certain households.

Consider what happens when we compute each \(j\) household’s average income, average value of \(e\), and average value of \(\Psi\) across time \(t\) in the data:

\[ \bar{y}_{J}= \frac{1}{\sum_{j,t} \mathbf{1}\{ j = J \} } \sum_{j,t} y_{jt} \mathbf{1}\{ j = J \} \] \[ \bar{e}_{J}= \frac{1}{\sum_{j,t} \mathbf{1}\{ j = J \} } \sum_{j,t} e_{jt} \mathbf{1}\{ j = J \} \] \[ \bar{\Psi}_{J} = \Psi_{J} \]

Notice that the mean of \(\Psi_{j}\) does not change over time for a fixed household \(j\). Hence, we can subtract the two household level means from the original equation to get:

\[ y_{jt} - \bar{y}_{j} = e_{jt} - \bar{e}_{j} + \underbrace{ \Psi_{j} - \bar{\Psi}_{j} }_\text{equals zero!} \]

Therefore, we are able to get rid of the unobserved heterogeneity in household determinants of income via “de-meaning”! This is called a within-group or fixed-effects transformation. If we believe these types of unobserved errors/shocks are creating endogeneity, we can get rid of them using this powerful trick. In some cases, we may alternatively choose to do a first-difference transformation of our regression specification. This entails subtracting the regression in one period not from it’s expectation across time, but from the regression in the previous period. In this case, time-invariant characteristics are similarly removed from the regression since they are constant across all periods \(t\).

14.7 Common Mistakes

One common mistake is not to respect the order set by R in defining the ordering variables. By default, R orders panel data based on a cross-sectional ID first and a time variable second. If we change the order of the indices, then the estimates produced by R will change.

If we invert the order of the cross-sectional ID (country) and the time variable (year) we may get different results.

# Default order
plm(lngdp ~ lnhc, data = panel_data, model="within")

Warning in is.pdata.frame(data, feedback = "warn"): input data claims to be a pdata.frame but does not seem to have compliant properties, results can be unreliable. This can happen due to data manipulation by non-pdata.frame-aware functions (e.g., 'dplyr' on pdata.frame). 
 Maybe re-create data input as fresh pdata.frame after last data manipulation with other tools.


Model Formula: lngdp ~ lnhc

Coefficients:
  lnhc 
2.4987

# Inverted order
plm(lngdp ~ lnhc, data = panel_data, model="within", index=c("year","countrycode"))

Warning in is.pdata.frame(data, feedback = "warn"): input data claims to be a pdata.frame but does not seem to have compliant properties, results can be unreliable. This can happen due to data manipulation by non-pdata.frame-aware functions (e.g., 'dplyr' on pdata.frame). 
 Maybe re-create data input as fresh pdata.frame after last data manipulation with other tools.


Model Formula: lngdp ~ lnhc

Coefficients:
  lnhc 
2.4987

Another common mistake happens with the lag() and lead() functions. Since there are several functions with this name, it’s always best to specify to R that we want to use the lag() and lead() functions from the package dplyr.

See what happens when we forget to specify it: do you see any difference between lag1_lngdp and new_lag1_lngdp?

# Create lag using dplyr::lag
panel_data <- panel_data %>% mutate(lag1_lngdp = dplyr::lag(lngdp,1))

# Create lag using lag
panel_data <- panel_data %>% mutate(new_lag1_lngdp = lag(lngdp,1))

# Check the difference
head(panel_data[, c("lngdp", "lag1_lngdp", "new_lag1_lngdp")],50)

            lngdp lag1_lngdp new_lag1_lngdp
AGO-1970 8.485130         NA       8.485130
AGO-1971 8.524844   8.485130       8.524844
AGO-1972 8.480839   8.524844       8.480839
AGO-1973 8.518095   8.480839       8.518095
AGO-1974 8.483309   8.518095       8.483309
AGO-1975 8.377240   8.483309       8.377240
AGO-1976 8.313675   8.377240       8.313675
AGO-1977 8.334398   8.313675       8.334398
AGO-1978 8.189842   8.334398       8.189842
AGO-1979 8.177878   8.189842       8.177878
AGO-1980 8.151327   8.177878       8.151327
AGO-1981 8.105898   8.151327       8.105898
AGO-1982 8.083022   8.105898       8.083022
AGO-1983 8.050258   8.083022       8.050258
AGO-1984 8.105516   8.050258       8.105516
AGO-1985 8.104471   8.105516       8.104471
AGO-1986 7.953249   8.104471       7.953249
AGO-1987 8.034101   7.953249       8.034101
AGO-1988 7.939507   8.034101       7.939507
AGO-1989 7.969447   7.939507       7.969447
AGO-1990 8.018731   7.969447       8.018731
AGO-1991 7.908242   8.018731       7.908242
AGO-1992 8.032713   7.908242       8.032713
AGO-1993 7.850252   8.032713       7.850252
AGO-1994 7.882472   7.850252       7.882472
AGO-1995 7.737827   7.882472       7.737827
AGO-1996 7.944758   7.737827       7.944758
AGO-1997 7.940777   7.944758       7.940777
AGO-1998 7.958970   7.940777       7.958970
AGO-1999 7.899834   7.958970       7.899834
AGO-2000 7.709125   7.899834       7.709125
AGO-2001 7.755491   7.709125       7.755491
AGO-2002 7.837389   7.755491       7.837389
AGO-2003 7.894540   7.837389       7.894540
AGO-2004 8.049262   7.894540       8.049262
AGO-2005 8.322772   8.049262       8.322772
AGO-2006 8.573490   8.322772       8.573490
AGO-2007 8.667330   8.573490       8.667330
AGO-2008 8.842660   8.667330       8.842660
AGO-2009 8.542922   8.842660       8.542922
AGO-2010 8.829720   8.542922       8.829720
AGO-2011 9.028112   8.829720       9.028112
AGO-2012 9.120750   9.028112       9.120750
AGO-2013 9.140014   9.120750       9.140014
AGO-2014 9.154849   9.140014       9.154849
AGO-2015 8.994511   9.154849       8.994511
AGO-2016 8.933472   8.994511       8.933472
AGO-2017 8.951422   8.933472       8.951422
AGO-2018 8.934411   8.951422       8.934411
AGO-2019 8.876206   8.934411       8.876206

14.8 Wrap Up

In this module, we’ve learned how to address linear regression in the case where we have access to two dimensions: cross-sectional variation and time variation. The usefulness of time variation is that it allows us to control for time-invariant components of the error term which may be causing endogeneity. We also investigated different ways for addressing problems such as heteroskedasticity and autocorrelation in our standard errors when working specifically with panel data. In the next module, we will cover a popular research design method: difference-in-differences.

14.9 Wrap-up Table

Command	Function
`pdata.frame`	It transforms a data frame in panel data format.
`plm`	It estimates a linear model with panel data. Use option “within” for Fixed-Effects and “random” for Random-Effects.
`phtest`	It performs a test to choose between Fixed-Effects and Random-Effects model.
`pFtest`	It performs a test to choose whether time fixed-effects are needed.
`dplyr::lag`	It creates lag variables.
`dplyr::lead`	It creates lead variables.
`bptest`	It tests for heteroskedasticity.
`pbgtest`	It tests for serial correlation.
`grangertest`	It tests for Granger causality.

References

Formatting and managing dates
Time-series operators (lags)