COMET
  • Get Started
    • Quickstart Guide
    • Install and Use COMET
    • Get Started
  • Learn By Skill Level
    • Getting Started
    • Beginner
    • Intermediate - Econometrics
    • Intermediate - Geospatial
    • Advanced

    • Browse All
  • Learn By Class
    • Making Sense of Economic Data (ECON 226/227)
    • Econometrics I (ECON 325)
    • Econometrics II (ECON 326)
    • Statistics in Geography (GEOG 374)
  • Learn to Research
    • Learn How to Do a Project
  • Teach With COMET
    • Learn how to teach with Jupyter and COMET
    • Using COMET in the Classroom
    • See COMET presentations
  • Contribute
    • Install for Development
    • Write Self Tests
  • Launch COMET
    • Launch on JupyterOpen (with Data)
    • Launch on JupyterOpen (lite)
    • Launch on Syzygy
    • Launch on Colab
    • Launch Locally

    • Project Datasets
    • Github Repository
  • |
  • About
    • COMET Team
    • Copyright Information

On this page

  • Prerequisites
  • Introduction:
  • Note:
  • Development and the Planet
  • Part 1: Preparing our Data
    • Importing Data into R
    • Viewing the Data
    • Cleaning the Data
  • Part 2: Building our Multiple Regression Model
  • Part 3: Addressing Issues and Improving the Model
    • Underlying Assumptions - Homoskedasticity
    • Multicollinearity
    • Adding Other Variables
  • Summary:
  • Exercises:
    • Exercise 1:
    • Exercise 2:
    • Exercise 3:
    • Exercise 4:
  • Report an issue

Other Formats

  • Jupyter

Projects - Example Project for ECON 326

econ 326
regression
projects
R
visualization
data wrangling
data cleaning
multiple regression
homoskedasticity
multicollinearity
Breusch-Pagan test
heteroskedasticity
vif
dummy variable
interaction terms
Let’s put it all together! This notebook is an example of what a “final project” might look like in ECON 326. It summarizes and uses all many of the empirical skills and R commands present in the other notebooks.
Author

COMET Team
Sarthak Kwatra, Charlotte White

Published

1 July 2023

Prerequisites

  • Introduction to Data in R
  • Introduction to Data Visualization - I and II
  • Simple Regression
  • Multiple Regression
  • Issues in Regression using R
  • Interactions and Non-Linear Terms in Regressions

Introduction:

Now that you are well armored with a statistical toolkit and experience with R, you are well on your way to embark on your own economic research adventure! This project serves as a sample to give you some intuition into the broad steps to a successful research project. It synthesizes the knowledge you have gained in your study of the ECON 325 and ECON 326 modules, and allows you to apply it to your own research project. It explains the steps involved in cleaning your data and preparing it for analysis, the actual analysis itself, and the careful interpretation and visualization of that analysis.

It is important to note that while the more minute tasks in each of these big steps may vary according to the needs of the project, these steps remain mostly the same. Let’s get started by importing all of the packages that we will use through out this module!

# If any of the packages happened to not be installed for you, use the command install.packages() with the name of the packages, like 'stargazer'

library(ggplot2) 
library(haven)
Warning: package 'haven' was built under R version 4.4.2
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.2
Warning: package 'forcats' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(vtable)
Loading required package: kableExtra
Warning: package 'kableExtra' was built under R version 4.4.3

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
library(sandwich)
library(corrplot)
corrplot 0.95 loaded
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
source("Projects_Example_Project_ECON326_tests.r")

Attaching package: 'testthat'

The following object is masked from 'package:dplyr':

    matches

The following object is masked from 'package:purrr':

    is_null

The following objects are masked from 'package:readr':

    edition_get, local_edition

The following object is masked from 'package:tidyr':

    matches
Warning: package 'digest' was built under R version 4.4.2

Note:

Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Development and the Planet

Following a 2019 scientific report which revealed an alarming rate of climate change in the country, the [Government of Canada declared a national climate emergency.] (https://globalnews.ca/news/5401586/canada-national-climate-emergency/) Canada has been far from the only one to take notice and when it comes to the environment; threat of climate change continues to take priority at an international level. All over the world, people are seeking to better understand the causes and impacts of climate change, and looking for ways to mitigate and adapt to how these changes will affect our lives. In particular, the greenhouse gas carbon dioxide [\(CO_2\)] gets a lot of attention. While there are many other gasses that contribute to to the atmospheric greenhouse effect, [\(CO_2\)] is one of the most immediate concerns because of its role in industrialization and energy use.

Gross domestic product (GDP), is a measure that you’re likely very familiar with at this point. As a measure production, GDP is often used to infer the health of an economy or to some degree, the prosperity of the people operating within it. In general, a rising GDP is a desirable outcome. However, we might wonder whether all other outcomes associated with a higher GDP are desirable. In this project, we will be examining the connection between the production of [\(CO_2\)] and GDP in Canada.

****🔎 Let’s think critically**** > 🟠 GDP is commonly considered to not be a zero-sum measure, meaning that a rising GDP in one country does not mean another country’s GDP has to fall. What are the limitations on GDP growth, then? > 🟠 What are the implications of assuming that there can infinite GDP growth when it’s connected to finite measures such as the amount of [\(CO_2\)] that can be sustainably produced and recaptured? Is this just a reflection of what our current energy sources and technology allow, or is there more to the story in how we think about economic growth in general?

Part 1: Preparing our Data

For the sake of our analysis today, we hope to observe whether factors like Electricity Generation, GDP, and Population, have had any impact on CO2 Emissions across all the Canadian Provinces.

Importing Data into R

Once you have gathered data, R has great dependability and dexterity in the viewing and manipulation of that data. To do this, you will want to import your datasets into R, like you have observed in multiple other modules so far. The data that you have gathered could be in a host of different formats like,

  • .csv (Comma-Separated Values file),
  • .dta (STATA data file),
  • .xlsx (Excel file),
  • .sav (SPSS file) or,
  • .sas (SAS file)

All of these files correspond to different softwares, like Microsoft Excel, STATA, or SPSS, but can nonetheless be conveniently imported onto R. Fortunately, we will not be needing separate packages to import these files; haven is our jack-of-all-trades. We used the command library(haven) to load it at beginning of this module. In this case, since all of our data is in the .csv format, we use the function read_csv. The corresponding functions for the other formats are, read_dta, read_spss, and so on.

# Loading the Data into R

gdp_data <- read_csv("../datasets_projects/gdp_data.csv")
Rows: 156 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): GEO, DGUID, Value, North American Industry Classification System (N...
dbl (5): REF_DATE, UOM_ID, SCALAR_ID, VALUE, DECIMALS
lgl (3): STATUS, SYMBOL, TERMINATED

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pollution_data <- read_csv("../datasets_projects/pollution_data.csv")
Rows: 840 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): GEO, DGUID, Sector, UOM, SCALAR_FACTOR, VECTOR
dbl (6): REF_DATE, UOM_ID, SCALAR_ID, COORDINATE, VALUE, DECIMALS
lgl (3): STATUS, SYMBOL, TERMINATED

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
elec_data <- read_csv("../datasets_projects/elec_data.csv")
Rows: 13744 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): REF_DATE, GEO, DGUID, Class of electricity producer, Type of elect...
dbl  (4): UOM_ID, SCALAR_ID, VALUE, DECIMALS
lgl  (1): TERMINATED

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop_data <- read_csv("../datasets_projects/pop_data.csv")
Rows: 672 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): REF_DATE, GEO, DGUID, UOM, SCALAR_FACTOR, VECTOR
dbl (5): UOM_ID, SCALAR_ID, COORDINATE, VALUE, DECIMALS
lgl (3): STATUS, SYMBOL, TERMINATED

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

NOTE: By default, some functions in the Haven package, like read_csv(), assume that the CSV file has a header row with variable names. If your file does not have a header, or you would like different headers for your columns, you can use the argument col_names to adjust the column names manually.

Viewing the Data

Once you have imported your datasets in R, it is worthwhile to get an overview of the data. There are two main reasons for this:

  • Not every dataset will come formatted in a way that is suitable for your analysis, and therefore it is important to understand the structure of your dataset and its variables
  • An overview allows you to recognize any potential obvious issues that the data may have, like missing values, duplicates, or unnecessary variables, that would pose issues in your analysis at a later stage

Commands that can be used to view and understand the structure of your data include: head(), str(), summary(), and view(). These four functions can be used roughly interchangeably understand the structure of your data

# Make sure to run these commands individually!

head(gdp_data)
# A tibble: 6 × 16
  REF_DATE GEO     DGUID Value North American Indus…¹ UOM   UOM_ID SCALAR_FACTOR
     <dbl> <chr>   <chr> <chr> <chr>                  <chr>  <dbl> <chr>        
1     2009 Newfou… 2016… Chai… All industries [T001]  Doll…     81 millions     
2     2010 Newfou… 2016… Chai… All industries [T001]  Doll…     81 millions     
3     2011 Newfou… 2016… Chai… All industries [T001]  Doll…     81 millions     
4     2012 Newfou… 2016… Chai… All industries [T001]  Doll…     81 millions     
5     2013 Newfou… 2016… Chai… All industries [T001]  Doll…     81 millions     
6     2014 Newfou… 2016… Chai… All industries [T001]  Doll…     81 millions     
# ℹ abbreviated name: ¹​`North American Industry Classification System (NAICS)`
# ℹ 8 more variables: SCALAR_ID <dbl>, VECTOR <chr>, COORDINATE <chr>,
#   VALUE <dbl>, STATUS <lgl>, SYMBOL <lgl>, TERMINATED <lgl>, DECIMALS <dbl>
summary(pollution_data)
    REF_DATE        GEO               DGUID              Sector         
 Min.   :2009   Length:840         Length:840         Length:840        
 1st Qu.:2012   Class :character   Class :character   Class :character  
 Median :2014   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2014                                                           
 3rd Qu.:2017                                                           
 Max.   :2020                                                           
     UOM                UOM_ID    SCALAR_FACTOR        SCALAR_ID
 Length:840         Min.   :198   Length:840         Min.   :0  
 Class :character   1st Qu.:198   Class :character   1st Qu.:0  
 Mode  :character   Median :198   Mode  :character   Median :0  
                    Mean   :198                      Mean   :0  
                    3rd Qu.:198                      3rd Qu.:0  
                    Max.   :198                      Max.   :0  
    VECTOR            COORDINATE         VALUE           STATUS       
 Length:840         Min.   : 1.100   Min.   :-51464.0   Mode:logical  
 Class :character   1st Qu.: 4.118   1st Qu.:   473.8   NA's:840      
 Mode  :character   Median : 7.650   Median :  6556.0                 
                    Mean   : 7.630   Mean   : 62992.9                 
                    3rd Qu.:11.118   3rd Qu.: 63519.5                 
                    Max.   :14.200   Max.   :788071.0                 
  SYMBOL        TERMINATED        DECIMALS
 Mode:logical   Mode:logical   Min.   :0  
 NA's:840       NA's:840       1st Qu.:0  
                               Median :0  
                               Mean   :0  
                               3rd Qu.:0  
                               Max.   :0  
str(elec_data)
spc_tbl_ [13,744 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ REF_DATE                      : chr [1:13744] "2009-01" "2009-02" "2009-03" "2009-04" ...
 $ GEO                           : chr [1:13744] "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" ...
 $ DGUID                         : chr [1:13744] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ Class of electricity producer : chr [1:13744] "Total all classes of electricity producer" "Total all classes of electricity producer" "Total all classes of electricity producer" "Total all classes of electricity producer" ...
 $ Type of electricity generation: chr [1:13744] "Total all types of electricity generation" "Total all types of electricity generation" "Total all types of electricity generation" "Total all types of electricity generation" ...
 $ UOM                           : chr [1:13744] "Megawatt hours" "Megawatt hours" "Megawatt hours" "Megawatt hours" ...
 $ UOM_ID                        : num [1:13744] 210 210 210 210 210 210 210 210 210 210 ...
 $ SCALAR_FACTOR                 : chr [1:13744] "units" "units" "units" "units" ...
 $ SCALAR_ID                     : num [1:13744] 0 0 0 0 0 0 0 0 0 0 ...
 $ VECTOR                        : chr [1:13744] "v44174695" "v44174695" "v44174695" "v44174695" ...
 $ COORDINATE                    : chr [1:13744] "2.1.1" "2.1.1" "2.1.1" "2.1.1" ...
 $ VALUE                         : num [1:13744] 4174723 3706629 4082102 3268823 2754027 ...
 $ STATUS                        : chr [1:13744] NA NA NA NA ...
 $ SYMBOL                        : chr [1:13744] NA NA NA NA ...
 $ TERMINATED                    : logi [1:13744] NA NA NA NA NA NA ...
 $ DECIMALS                      : num [1:13744] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   REF_DATE = col_character(),
  ..   GEO = col_character(),
  ..   DGUID = col_character(),
  ..   `Class of electricity producer` = col_character(),
  ..   `Type of electricity generation` = col_character(),
  ..   UOM = col_character(),
  ..   UOM_ID = col_double(),
  ..   SCALAR_FACTOR = col_character(),
  ..   SCALAR_ID = col_double(),
  ..   VECTOR = col_character(),
  ..   COORDINATE = col_character(),
  ..   VALUE = col_double(),
  ..   STATUS = col_character(),
  ..   SYMBOL = col_character(),
  ..   TERMINATED = col_logical(),
  ..   DECIMALS = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
view(pop_data)

An overview of our data reveals a few interesting things. All of data has been collected for the years 2009 - 2020. However, while the GDP and CO2 Emissions data is Annual, the Electricity Generation Data is Monthly, and the Population Data is Quarterly. It is also interesting to note that the some of the values for Electricity Generation are missing for some years for the Provinces of Newfoundland and Labrador, and Prince Edward Island.

Cleaning the Data

Having recognized these potential issues, getting rid of them is important, and it deems the name “Cleaning the Data” to this section of the project. An important rough structure to keep in mind while cleaning your data is called “Tidy Data”, introduced by the statistician Hadley Wickham, where,

  • Each Variable has its own Column
  • Each Observation has its own Row, and,
  • Each Value has its own Cell

To begin with, we try to keep the column names of our variables such that they are short and easy to manipulate, so let’s change some of the column names in our datasets.

# Changing the Names across our Datasets

pollution_data <- pollution_data %>% rename(c(year = REF_DATE, province = GEO, sector = Sector, CO2 = VALUE))

gdp_data <- gdp_data %>% rename(c(year = REF_DATE, province = GEO, NAICS = `North American Industry Classification System (NAICS)`, GDP = VALUE))

elec_data <- elec_data %>% rename(c(year = REF_DATE, province = GEO, type = `Type of electricity generation`, elec = VALUE))

pop_data <- pop_data %>% rename(c(year = REF_DATE, province = GEO, pop = VALUE))

Next, note that across our Pollution and Population datasets, there are aggregations to a Canada-wide level, while our analysis is limited to the Provinces. Therefore, an inclusion of the Canada-wide aggregations will lead to a bias in our results. Let’s get rid of that by filtering them out.

# Filtering to keep every observation for which the GEO isn't equal to Canada

pop_data <- pop_data %>% filter(province != 'Canada')
pollution_data <- pollution_data %>% filter(province != 'Canada')

As noted before, there were some missing values in the Electricity Generation dataset. Although there are multiple ways of dealing with missing data, like using averages, or using advanced imputation techniques like multiple imputation, we choose to deal with missing values here by omitting them from our data.

elec_data <- elec_data %>% filter(elec != is.na(elec))

Similar aggregations also exist in the Pollution dataset for “Total, industries and households”, “Total, industries”, and “Total, households”. They also exist in the Electricity Generation dataset as “Total all types of electricity generation”. Let’s filter them, only this time, we will keep the aggregates across the categories of electricity generation and pollution, and get rid of the sub-categories.

pollution_data <- pollution_data %>% filter(sector == "Total, industries and households")
elec_data <- elec_data %>% filter(type == 'Total all types of electricity generation')

Next, as we previously noted, while the GDP and CO2 Emissions data is Annual, the Electricity Generation Data is Monthly, and the Population Data is Quarterly. Therefore, let’s group them both to Yearly levels. Before we do that, note that “REF_DATE” contains the variables Month and Year. Therefore, satisfying our principles Tidy Data, let’s use the Substring function to break it down into Month and Year.

elec_data <- elec_data %>%
  mutate(year = substr(year, 1, 4), month = substr(year, 6, 7))

pop_data <- pop_data %>%
  mutate(year = substr(year, 1, 4), month = substr(year, 6, 7))

Now, let’s work on making both the Electricity and Population datasets annual.

elec_data_grouped <- elec_data %>%
  group_by(year, province) %>%
  summarise(electricity = round(mean(elec)))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by year and province.
ℹ Output is grouped by year.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(year, province))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
pop_data_grouped <- pop_data %>%
  group_by(year, province) %>%
  summarise(population = round(mean(pop)))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by year and province.
ℹ Output is grouped by year.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(year, province))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

Our next step will be to merge our datasets, so that we can smoothly run the analysis from one clean reference.

# Making the Data types compatible for joining
pop_data_grouped <- pop_data_grouped %>% mutate(year = as.double(year))
elec_data_grouped <- elec_data_grouped %>% mutate(year = as.double(year))

# Merging the four datasets into two
merged_data_1 <- left_join(gdp_data, pop_data_grouped, by = c('year', 'province'))
merged_data_2 <- left_join(pollution_data, elec_data_grouped, by = c('year', 'province'))

# Performing the Final Merge
merged_data <- left_join(merged_data_1, merged_data_2, by = c('year', 'province'))
str(merged_data)
spc_tbl_ [156 × 31] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ year           : num [1:156] 2009 2010 2011 2012 2013 ...
 $ province       : chr [1:156] "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" ...
 $ DGUID.x        : chr [1:156] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ Value          : chr [1:156] "Chained (2012) dollars" "Chained (2012) dollars" "Chained (2012) dollars" "Chained (2012) dollars" ...
 $ NAICS          : chr [1:156] "All industries [T001]" "All industries [T001]" "All industries [T001]" "All industries [T001]" ...
 $ UOM.x          : chr [1:156] "Dollars" "Dollars" "Dollars" "Dollars" ...
 $ UOM_ID.x       : num [1:156] 81 81 81 81 81 81 81 81 81 81 ...
 $ SCALAR_FACTOR.x: chr [1:156] "millions" "millions" "millions" "millions" ...
 $ SCALAR_ID.x    : num [1:156] 6 6 6 6 6 6 6 6 6 6 ...
 $ VECTOR.x       : chr [1:156] "v62464519" "v62464519" "v62464519" "v62464519" ...
 $ COORDINATE.x   : chr [1:156] "1.2.1" "1.2.1" "1.2.1" "1.2.1" ...
 $ GDP            : num [1:156] 28944 30474 31376 29977 31486 ...
 $ STATUS.x       : logi [1:156] NA NA NA NA NA NA ...
 $ SYMBOL.x       : logi [1:156] NA NA NA NA NA NA ...
 $ TERMINATED.x   : logi [1:156] NA NA NA NA NA NA ...
 $ DECIMALS.x     : num [1:156] 1 1 1 1 1 1 1 1 1 1 ...
 $ population     : num [1:156] 516170 521590 524725 526349 527448 ...
 $ DGUID.y        : chr [1:156] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ sector         : chr [1:156] "Total, industries and households" "Total, industries and households" "Total, industries and households" "Total, industries and households" ...
 $ UOM.y          : chr [1:156] "Kilotonnes" "Kilotonnes" "Kilotonnes" "Kilotonnes" ...
 $ UOM_ID.y       : num [1:156] 198 198 198 198 198 198 198 198 198 198 ...
 $ SCALAR_FACTOR.y: chr [1:156] "units" "units" "units" "units" ...
 $ SCALAR_ID.y    : num [1:156] 0 0 0 0 0 0 0 0 0 0 ...
 $ VECTOR.y       : chr [1:156] "v1013973302" "v1013973302" "v1013973302" "v1013973302" ...
 $ COORDINATE.y   : num [1:156] 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
 $ CO2            : num [1:156] 9324 9474 9627 9171 9260 ...
 $ STATUS.y       : logi [1:156] NA NA NA NA NA NA ...
 $ SYMBOL.y       : logi [1:156] NA NA NA NA NA NA ...
 $ TERMINATED.y   : logi [1:156] NA NA NA NA NA NA ...
 $ DECIMALS.y     : num [1:156] 0 0 0 0 0 0 0 0 0 0 ...
 $ electricity    : num [1:156] 3114893 3396756 3392520 3582204 3511897 ...
 - attr(*, "spec")=
  .. cols(
  ..   REF_DATE = col_double(),
  ..   GEO = col_character(),
  ..   DGUID = col_character(),
  ..   Value = col_character(),
  ..   `North American Industry Classification System (NAICS)` = col_character(),
  ..   UOM = col_character(),
  ..   UOM_ID = col_double(),
  ..   SCALAR_FACTOR = col_character(),
  ..   SCALAR_ID = col_double(),
  ..   VECTOR = col_character(),
  ..   COORDINATE = col_character(),
  ..   VALUE = col_double(),
  ..   STATUS = col_logical(),
  ..   SYMBOL = col_logical(),
  ..   TERMINATED = col_logical(),
  ..   DECIMALS = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
#renaming some categories
merged_data <- merged_data  %>%
  rename(gdp = GDP)

# Now we need some factors. `Province` should be a factor variable. 

merged_data  <- merged_data  %>%
  mutate(province = case_when(
    province == "Newfoundland and Labrador" ~ "1",
    province == "Prince Edward Island" ~ "2",
    province == "Nova Scotia" ~ "3",
    province == "New Brunswick" ~ "4",
    province == "Quebec" ~ "5",
    province == "Ontario" ~ "6",
    province == "Manitoba" ~ "7",
    province == "Saskatchewan" ~ "8",
    province == "Alberta" ~ "9",
    province == "British Columbia" ~ "10",
    province == "Yukon" ~ "11",
    province == "Northwest Territories" ~ "12",
    province == "Nunavut" ~ "13",
  )) %>%
  mutate(province = as_factor(province))

#Now for clarity, we'll rename the dataset.

CO2_data <- merged_data

Now let’s see how that’s changed our data structure.

str(CO2_data)
tibble [156 × 31] (S3: tbl_df/tbl/data.frame)
 $ year           : num [1:156] 2009 2010 2011 2012 2013 ...
 $ province       : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ DGUID.x        : chr [1:156] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ Value          : chr [1:156] "Chained (2012) dollars" "Chained (2012) dollars" "Chained (2012) dollars" "Chained (2012) dollars" ...
 $ NAICS          : chr [1:156] "All industries [T001]" "All industries [T001]" "All industries [T001]" "All industries [T001]" ...
 $ UOM.x          : chr [1:156] "Dollars" "Dollars" "Dollars" "Dollars" ...
 $ UOM_ID.x       : num [1:156] 81 81 81 81 81 81 81 81 81 81 ...
 $ SCALAR_FACTOR.x: chr [1:156] "millions" "millions" "millions" "millions" ...
 $ SCALAR_ID.x    : num [1:156] 6 6 6 6 6 6 6 6 6 6 ...
 $ VECTOR.x       : chr [1:156] "v62464519" "v62464519" "v62464519" "v62464519" ...
 $ COORDINATE.x   : chr [1:156] "1.2.1" "1.2.1" "1.2.1" "1.2.1" ...
 $ gdp            : num [1:156] 28944 30474 31376 29977 31486 ...
 $ STATUS.x       : logi [1:156] NA NA NA NA NA NA ...
 $ SYMBOL.x       : logi [1:156] NA NA NA NA NA NA ...
 $ TERMINATED.x   : logi [1:156] NA NA NA NA NA NA ...
 $ DECIMALS.x     : num [1:156] 1 1 1 1 1 1 1 1 1 1 ...
 $ population     : num [1:156] 516170 521590 524725 526349 527448 ...
 $ DGUID.y        : chr [1:156] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ sector         : chr [1:156] "Total, industries and households" "Total, industries and households" "Total, industries and households" "Total, industries and households" ...
 $ UOM.y          : chr [1:156] "Kilotonnes" "Kilotonnes" "Kilotonnes" "Kilotonnes" ...
 $ UOM_ID.y       : num [1:156] 198 198 198 198 198 198 198 198 198 198 ...
 $ SCALAR_FACTOR.y: chr [1:156] "units" "units" "units" "units" ...
 $ SCALAR_ID.y    : num [1:156] 0 0 0 0 0 0 0 0 0 0 ...
 $ VECTOR.y       : chr [1:156] "v1013973302" "v1013973302" "v1013973302" "v1013973302" ...
 $ COORDINATE.y   : num [1:156] 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
 $ CO2            : num [1:156] 9324 9474 9627 9171 9260 ...
 $ STATUS.y       : logi [1:156] NA NA NA NA NA NA ...
 $ SYMBOL.y       : logi [1:156] NA NA NA NA NA NA ...
 $ TERMINATED.y   : logi [1:156] NA NA NA NA NA NA ...
 $ DECIMALS.y     : num [1:156] 0 0 0 0 0 0 0 0 0 0 ...
 $ electricity    : num [1:156] 3114893 3396756 3392520 3582204 3511897 ...

Great! We’re ready to start building our model.

Part 2: Building our Multiple Regression Model

Now that we have our dataset ready and cleaned, let’s start to think about building our model. What are the relationships that we’re interested in investigating? For the dataset that we’re working with, these would be : gross domestic product (GDP), electricity, and population.

Think Deeper: Why might we suspect that relationships exist between these variables? Is this consistent with economic theory? How would these relationships relate to your own experience?

Let’s begin investigating these relationships by making some visualizations.

corr_plot_data <- CO2_data[, c('CO2', 'gdp', 'population', 'electricity')]
corr_plot_data <- as.data.frame(corr_plot_data)

# Compute the correlation matrix)
cor_matrix = cor(corr_plot_data)

# Create the correlation plot
corrplot(cor_matrix, order = 'hclust', addrect = 2)

a <- ggplot(data = CO2_data, aes(x = gdp, y = CO2))+
  xlab("GDP in millions of dollars")+
  ylab("CO2 Emissions in kilotonnes") + scale_x_continuous()

a + geom_point()

b <- ggplot(data = CO2_data, aes(x = population, y = CO2))+
  xlab("Population")+
  ylab("CO2 Emissions in kilotonnes") + scale_x_continuous()

b + geom_point()

c <- ggplot(data = CO2_data, aes(x = electricity, y = CO2))+
  xlab("Electricity in megawatt hours)")+
  ylab("CO2 Emissions in kilotonnes") + scale_x_continuous()

c + geom_point()

view(gdp_data)

From these plots, we can see visually that there appears to be an approximate linear realtionship between the continuous independent variables and CO2. Now that we’ve established some of these relationships, let’s build our multiple regression model.

slr_1 <- lm(CO2 ~ gdp, data = CO2_data)
summary(slr_1)

Call:
lm(formula = CO2 ~ gdp, data = CO2_data)

Residuals:
   Min     1Q Median     3Q    Max 
-90091 -15779 -14917  -5893 172430 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.561e+04  5.099e+03   3.062   0.0026 ** 
gdp         3.136e-01  2.152e-02  14.572   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51790 on 154 degrees of freedom
Multiple R-squared:  0.5796,    Adjusted R-squared:  0.5769 
F-statistic: 212.3 on 1 and 154 DF,  p-value: < 2.2e-16
slr_2 <- lm(CO2 ~ electricity, data = CO2_data)
summary(slr_2)

Call:
lm(formula = CO2 ~ electricity, data = CO2_data)

Residuals:
   Min     1Q Median     3Q    Max 
-95309 -23920 -22498  10291 221560 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.411e+04  6.687e+03   3.605 0.000421 ***
electricity 8.857e-03  1.046e-03   8.470 1.83e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 65980 on 154 degrees of freedom
Multiple R-squared:  0.3178,    Adjusted R-squared:  0.3133 
F-statistic: 71.73 on 1 and 154 DF,  p-value: 1.832e-14
slr_3<- lm(CO2 ~ population, data = CO2_data)
summary(slr_3)

Call:
lm(formula = CO2 ~ population, data = CO2_data)

Residuals:
   Min     1Q Median     3Q    Max 
-63493 -23355 -20224 -12785 216382 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.341e+04  5.968e+03   3.922 0.000132 ***
population  1.293e-02  1.242e-03  10.405  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 61210 on 154 degrees of freedom
Multiple R-squared:  0.4128,    Adjusted R-squared:  0.409 
F-statistic: 108.3 on 1 and 154 DF,  p-value: < 2.2e-16
mlr_1 <- lm(CO2 ~ gdp + electricity + population, data = CO2_data)
summary(mlr_1)

Call:
lm(formula = CO2 ~ gdp + electricity + population, data = CO2_data)

Residuals:
   Min     1Q Median     3Q    Max 
-92224  -8868  -4123  13767  62333 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.518e+03  2.375e+03   3.166  0.00187 ** 
gdp          1.601e+00  5.188e-02  30.858  < 2e-16 ***
electricity  6.798e-03  8.033e-04   8.463 2.04e-14 ***
population  -7.151e-02  2.917e-03 -24.519  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22860 on 152 degrees of freedom
Multiple R-squared:  0.9192,    Adjusted R-squared:  0.9176 
F-statistic: 576.2 on 3 and 152 DF,  p-value: < 2.2e-16

As seen from the adjusted r-squared output, our multiple regression model has greater explanatory power than the any of the simple regressions alone, and all of the coefficients given are significant.

Part 3: Addressing Issues and Improving the Model

Underlying Assumptions - Homoskedasticity

Homoskedasticity, or constant variance, is an underlying assumption of OLS. Knowing that heteroskedasticity is another common issue in regression, we need to check our model to ensure that it meets this requirement. We’ll start by checking visually with a residual plot for our first variable, GDP.

slr_1 <- lm(CO2 ~ gdp, data = CO2_data)

ggplot(data = CO2_data, aes(x = as.numeric(gdp), y = as.numeric(slr_1$residuals))
        )+geom_point()+labs(x = "GDP", y = "Residuals")

From this “eyeball test”, it looks like the data display heterskedasticity. We’ll test for this formally now with a Breusch-Pagan test.

CO2_data$resid_sqgdp <-  (slr_1$residuals)^2
residualsslr_1 <- lm(resid_sqgdp ~ gdp, data = CO2_data)
summary(residualsslr_1)

Call:
lm(formula = resid_sqgdp ~ gdp, data = CO2_data)

Residuals:
       Min         1Q     Median         3Q        Max 
-8.419e+09 -1.409e+09 -9.502e+08 -5.301e+08  2.448e+10 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 742582663  650815713   1.141    0.256    
gdp             13820       2747   5.031 1.34e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.611e+09 on 154 degrees of freedom
Multiple R-squared:  0.1412,    Adjusted R-squared:  0.1356 
F-statistic: 25.31 on 1 and 154 DF,  p-value: 1.343e-06

We reject the null hypothesis and conclude that the GDP data are heteroskedastic. Let’s try to address this with a log transformation.

# transform the data
CO2_data <- CO2_data %>%
  mutate(ln_gdp = log(gdp))

# visualize the logged value residuals
slr_ln1 <- lm(CO2 ~ ln_gdp, data = CO2_data) 

ggplot(data = CO2_data, aes(x = as.numeric(ln_gdp), y = as.numeric(slr_ln1$residuals))) + 
  geom_point()+labs(x = "Log GDP", y = "Residuals")

# conduct another Breusch-Pagan Test
CO2_data$resid_sq_lngdp <-  (slr_ln1$residuals)^2
residualsslr_ln1 <- lm(resid_sq_lngdp ~ ln_gdp, data = CO2_data)
summary(residualsslr_ln1)

Call:
lm(formula = resid_sq_lngdp ~ ln_gdp, data = CO2_data)

Residuals:
       Min         1Q     Median         3Q        Max 
-5.921e+09 -3.334e+09 -6.787e+08 -1.289e+08  2.137e+10 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.941e+09  2.537e+09  -3.524 0.000559 ***
ln_gdp       1.103e+09  2.370e+08   4.655 6.93e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.501e+09 on 154 degrees of freedom
Multiple R-squared:  0.1234,    Adjusted R-squared:  0.1177 
F-statistic: 21.67 on 1 and 154 DF,  p-value: 6.933e-06

This doesn’t seem to have fixed the problem of heteroskedasticity within the gdp data. We’ll overcome this in our final model by using robust standard errors. We’ll now go through a similar process with the remaining two variables, and conclude that the electricity data are also heteroskedastic.

#For `electricity`
#Visualise the residuals
slr_2 <- lm(CO2 ~ electricity, data = CO2_data)
ggplot(data = CO2_data, aes(x = as.numeric(electricity), y = as.numeric(slr_2$residuals))
        )+geom_point()+labs(x = "Electricity", y = "Residuals")

#Fromally test the hypothesis
CO2_data$resid_sqelec <-  (slr_2$residuals)^2
residualsslr_2 <- lm(resid_sqelec ~ electricity, data = CO2_data)
summary(residualsslr_2)

Call:
lm(formula = resid_sqelec ~ electricity, data = CO2_data)

Residuals:
       Min         1Q     Median         3Q        Max 
-8.710e+09 -2.958e+09 -2.135e+09 -1.816e+09  4.422e+10 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 2.323e+09  1.065e+09   2.182  0.03065 * 
electricity 5.035e+02  1.665e+02   3.024  0.00292 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.05e+10 on 154 degrees of freedom
Multiple R-squared:  0.05605,   Adjusted R-squared:  0.04992 
F-statistic: 9.144 on 1 and 154 DF,  p-value: 0.002925
#Log transform the data
CO2_data <- CO2_data %>%
  mutate(ln_electricity = log(electricity))

# visualize the logged value residuals
slr_ln2 <- lm(CO2 ~ ln_electricity, data = CO2_data)
ggplot(data = CO2_data, aes(x = as.numeric(ln_electricity), y = as.numeric(slr_ln2$residuals))
)+geom_point()+labs(x = "Log Electricity", y_ = "Residuals")

#Formally test 
CO2_data$resid_sq_lnelec <-  (slr_ln2$residuals)^2
residualsslr_ln2  <- lm(resid_sq_lnelec ~ ln_electricity, data = CO2_data)
summary(residualsslr_ln2)

Call:
lm(formula = resid_sq_lnelec ~ ln_electricity, data = CO2_data)

Residuals:
       Min         1Q     Median         3Q        Max 
-6.642e+09 -4.155e+09 -1.283e+09  3.570e+08  3.196e+10 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.171e+10  3.731e+09  -3.137  0.00204 ** 
ln_electricity  1.146e+09  2.694e+08   4.252 3.67e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.953e+09 on 154 degrees of freedom
Multiple R-squared:  0.1051,    Adjusted R-squared:  0.09924 
F-statistic: 18.08 on 1 and 154 DF,  p-value: 3.665e-05
# We reject the null hypothesis, and conclude heteroskedasticity in this data. 
#For `population`
#Visualise the residuals
slr_3 <- lm(CO2 ~ population, data = CO2_data)
ggplot(data = CO2_data, aes(x = as.numeric(population), y = as.numeric(slr_3$residuals))
        )+geom_point()+labs(x = "Population", y = "Residuals")

#Formal hypothesis testing
CO2_data$resid_sqpop <- (slr_3$residuals)^2
residualsslr_3 <- lm(resid_sqpop ~ population, data = CO2_data)
summary(residualsslr_3)

Call:
lm(formula = resid_sqpop ~ population, data = CO2_data)

Residuals:
       Min         1Q     Median         3Q        Max 
-6.843e+09 -2.926e+09 -2.577e+09 -2.292e+09  4.270e+10 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 2.819e+09  1.041e+09   2.707  0.00755 **
population  3.207e+02  2.168e+02   1.480  0.14100   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.068e+10 on 154 degrees of freedom
Multiple R-squared:  0.01402,   Adjusted R-squared:  0.007616 
F-statistic:  2.19 on 1 and 154 DF,  p-value: 0.141
# We cannot reject the null hypothesis, and therefore cannot conclude heteroscedasticity for the population data. 

To run our regressions with heteroskedasticity-robust standard errors, we use the sandwich package, which we’ve called on earlier.

# The initial regression that we ran
mlr_1 <- lm(CO2 ~ gdp + population + electricity, data = CO2_data)

# Obtain robust standard errors
robust_se <- sqrt(diag(vcovHC(mlr_1, type = "HC1")))  # "HC1" is one of the robust variance estimators

# Print the robust standard errors
print(robust_se)
 (Intercept)          gdp   population  electricity 
1.574100e+03 6.000953e-02 3.555961e-03 1.073823e-03 
# Usiing Coeftest to print the Hetetrokedasticity-Robust Standard Errors
coeftest(mlr_1)

t test of coefficients:

               Estimate  Std. Error  t value  Pr(>|t|)    
(Intercept)  7.5183e+03  2.3750e+03   3.1656   0.00187 ** 
gdp          1.6010e+00  5.1882e-02  30.8579 < 2.2e-16 ***
population  -7.1513e-02  2.9167e-03 -24.5185 < 2.2e-16 ***
electricity  6.7978e-03  8.0326e-04   8.4628 2.038e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multicollinearity

As we know, multicollinearity is a common issue that arises in developing a regression. To test how this shows up in our current model, we will calculate variance inflation factors (VIF), using the package car.

Think Deeper: Do you suspect any variables in the model may be affected by multicollinearity? How come?

car::vif(mlr_1)
        gdp  population electricity 
  29.840828   39.532929    4.915504 
coeftest(slr_3)

t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 2.3407e+04 5.9677e+03  3.9223 0.0001318 ***
population  1.2925e-02 1.2422e-03 10.4052 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(slr_3)

Call:
lm(formula = CO2 ~ population, data = CO2_data)

Residuals:
   Min     1Q Median     3Q    Max 
-63493 -23355 -20224 -12785 216382 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.341e+04  5.968e+03   3.922 0.000132 ***
population  1.293e-02  1.242e-03  10.405  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 61210 on 154 degrees of freedom
Multiple R-squared:  0.4128,    Adjusted R-squared:  0.409 
F-statistic: 108.3 on 1 and 154 DF,  p-value: < 2.2e-16

The high VIF indicate that there’s a problem with collinearity in our data. We’ll try to combat this by creating a model which combines the effects of GDP and population into a single variable of per capita GDP.

#Create a new variable and view how it fits in the structure. 
CO2_data$per_capita_gdp <- CO2_data$gdp/CO2_data$population
str(CO2_data)
tibble [156 × 39] (S3: tbl_df/tbl/data.frame)
 $ year           : num [1:156] 2009 2010 2011 2012 2013 ...
 $ province       : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ DGUID.x        : chr [1:156] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ Value          : chr [1:156] "Chained (2012) dollars" "Chained (2012) dollars" "Chained (2012) dollars" "Chained (2012) dollars" ...
 $ NAICS          : chr [1:156] "All industries [T001]" "All industries [T001]" "All industries [T001]" "All industries [T001]" ...
 $ UOM.x          : chr [1:156] "Dollars" "Dollars" "Dollars" "Dollars" ...
 $ UOM_ID.x       : num [1:156] 81 81 81 81 81 81 81 81 81 81 ...
 $ SCALAR_FACTOR.x: chr [1:156] "millions" "millions" "millions" "millions" ...
 $ SCALAR_ID.x    : num [1:156] 6 6 6 6 6 6 6 6 6 6 ...
 $ VECTOR.x       : chr [1:156] "v62464519" "v62464519" "v62464519" "v62464519" ...
 $ COORDINATE.x   : chr [1:156] "1.2.1" "1.2.1" "1.2.1" "1.2.1" ...
 $ gdp            : num [1:156] 28944 30474 31376 29977 31486 ...
 $ STATUS.x       : logi [1:156] NA NA NA NA NA NA ...
 $ SYMBOL.x       : logi [1:156] NA NA NA NA NA NA ...
 $ TERMINATED.x   : logi [1:156] NA NA NA NA NA NA ...
 $ DECIMALS.x     : num [1:156] 1 1 1 1 1 1 1 1 1 1 ...
 $ population     : num [1:156] 516170 521590 524725 526349 527448 ...
 $ DGUID.y        : chr [1:156] "2016A000210" "2016A000210" "2016A000210" "2016A000210" ...
 $ sector         : chr [1:156] "Total, industries and households" "Total, industries and households" "Total, industries and households" "Total, industries and households" ...
 $ UOM.y          : chr [1:156] "Kilotonnes" "Kilotonnes" "Kilotonnes" "Kilotonnes" ...
 $ UOM_ID.y       : num [1:156] 198 198 198 198 198 198 198 198 198 198 ...
 $ SCALAR_FACTOR.y: chr [1:156] "units" "units" "units" "units" ...
 $ SCALAR_ID.y    : num [1:156] 0 0 0 0 0 0 0 0 0 0 ...
 $ VECTOR.y       : chr [1:156] "v1013973302" "v1013973302" "v1013973302" "v1013973302" ...
 $ COORDINATE.y   : num [1:156] 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
 $ CO2            : num [1:156] 9324 9474 9627 9171 9260 ...
 $ STATUS.y       : logi [1:156] NA NA NA NA NA NA ...
 $ SYMBOL.y       : logi [1:156] NA NA NA NA NA NA ...
 $ TERMINATED.y   : logi [1:156] NA NA NA NA NA NA ...
 $ DECIMALS.y     : num [1:156] 0 0 0 0 0 0 0 0 0 0 ...
 $ electricity    : num [1:156] 3114893 3396756 3392520 3582204 3511897 ...
 $ resid_sqgdp    : Named num [1:156] 2.36e+08 2.46e+08 2.50e+08 2.51e+08 2.63e+08 ...
  ..- attr(*, "names")= chr [1:156] "1" "2" "3" "4" ...
 $ ln_gdp         : num [1:156] 10.3 10.3 10.4 10.3 10.4 ...
 $ resid_sq_lngdp : Named num [1:156] 1.67e+09 1.80e+09 1.86e+09 1.78e+09 1.91e+09 ...
  ..- attr(*, "names")= chr [1:156] "1" "2" "3" "4" ...
 $ resid_sqelec   : Named num [1:156] 1.80e+09 2.00e+09 1.98e+09 2.18e+09 2.11e+09 ...
  ..- attr(*, "names")= chr [1:156] "1" "2" "3" "4" ...
 $ ln_electricity : num [1:156] 15 15 15 15.1 15.1 ...
 $ resid_sq_lnelec: Named num [1:156] 5.84e+09 6.10e+09 6.07e+09 6.32e+09 6.24e+09 ...
  ..- attr(*, "names")= chr [1:156] "1" "2" "3" "4" ...
 $ resid_sqpop    : Named num [1:156] 4.31e+08 4.27e+08 4.23e+08 4.43e+08 4.40e+08 ...
  ..- attr(*, "names")= chr [1:156] "1" "2" "3" "4" ...
 $ per_capita_gdp : num [1:156] 0.0561 0.0584 0.0598 0.057 0.0597 ...
#Build a new model 
mlr_2 <- lm(CO2 ~ per_capita_gdp + electricity, data = CO2_data)
summary(mlr_2)

Call:
lm(formula = CO2 ~ per_capita_gdp + electricity, data = CO2_data)

Residuals:
   Min     1Q Median     3Q    Max 
-93135 -40891   5090  18764 193850 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -4.714e+04  1.777e+04  -2.653  0.00882 ** 
per_capita_gdp  1.182e+06  2.753e+05   4.292 3.13e-05 ***
electricity     9.986e-03  1.025e-03   9.738  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62530 on 153 degrees of freedom
Multiple R-squared:  0.3911,    Adjusted R-squared:  0.3831 
F-statistic: 49.13 on 2 and 153 DF,  p-value: < 2.2e-16
#Calculate VIF on the new model
car::vif(mlr_2)
per_capita_gdp    electricity 
      1.070429       1.070429 

This appears to have addressed out problems with multicollinearity in our model.

Adding Other Variables

There is still another variable that we haven’t considered in our model. Let’s take a look at how province affects CO2. Since our these variables are expressed qualitative factors, they need to be represented by dummy variables. R does this automatically when a dummy is used in regression, and excludes the first category (in our case Newfoundland and Labrador) as the reference variable.

slr_4 <- lm(CO2 ~ province, data = CO2_data)
summary(slr_4)

Call:
lm(formula = CO2 ~ province, data = CO2_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-24582.8   -749.2     20.4    954.4  15829.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    10043       1338   7.503 6.06e-12 ***
province2      -8314       1893  -4.392 2.17e-05 ***
province3       8868       1893   4.685 6.46e-06 ***
province4       8672       1893   4.581 9.97e-06 ***
province5      79673       1893  42.090  < 2e-16 ***
province6     159505       1893  84.264  < 2e-16 ***
province7      10799       1893   5.705 6.45e-08 ***
province8      66107       1893  34.923  < 2e-16 ***
province9     266504       1893 140.789  < 2e-16 ***
province10     69417       1893  36.671  < 2e-16 ***
province11     -9444       1893  -4.989 1.74e-06 ***
province12     -8003       1893  -4.228 4.18e-05 ***
province13     -9398       1893  -4.965 1.93e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4637 on 143 degrees of freedom
Multiple R-squared:  0.9969,    Adjusted R-squared:  0.9966 
F-statistic:  3797 on 12 and 143 DF,  p-value: < 2.2e-16
mlr_3 <- lm(CO2 ~ gdp + population + electricity + province, data = CO2_data)
summary(mlr_3)

Call:
lm(formula = CO2 ~ gdp + population + electricity + province, 
    data = CO2_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17216.7   -971.5     -1.1    948.3  11255.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.439e+04  3.391e+03   4.242 4.00e-05 ***
gdp          4.458e-01  4.449e-02  10.020  < 2e-16 ***
population  -4.323e-02  4.336e-03  -9.972  < 2e-16 ***
electricity  1.346e-03  1.099e-03   1.225 0.222741    
province2   -8.716e+03  3.588e+03  -2.429 0.016415 *  
province3    2.858e+04  3.971e+03   7.197 3.42e-11 ***
province4    2.262e+04  3.373e+03   6.707 4.53e-10 ***
province5    2.537e+05  2.021e+04  12.550  < 2e-16 ***
province6    4.336e+05  3.119e+04  13.902  < 2e-16 ***
province7    3.235e+04  2.980e+03  10.854  < 2e-16 ***
province8    7.298e+04  2.958e+03  24.670  < 2e-16 ***
province9    2.913e+05  6.092e+03  47.819  < 2e-16 ***
province10   1.639e+05  1.072e+04  15.299  < 2e-16 ***
province11  -1.330e+04  3.529e+03  -3.769 0.000241 ***
province12  -1.253e+04  3.530e+03  -3.549 0.000527 ***
province13  -1.331e+04  3.551e+03  -3.747 0.000260 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3534 on 140 degrees of freedom
Multiple R-squared:  0.9982,    Adjusted R-squared:  0.998 
F-statistic:  5236 on 15 and 140 DF,  p-value: < 2.2e-16

This model has great explnatory power, but not all coefficients are significant anymore. We’ll remove electricity and see how this affects our model.

mlr_4 <- lm(CO2 ~ gdp + population + province, data = CO2_data)
summary(mlr_4)

Call:
lm(formula = CO2 ~ gdp + population + province, data = CO2_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17693.9   -930.0     -3.9    845.9  11127.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.814e+04  1.455e+03  12.469  < 2e-16 ***
gdp          4.473e-01  4.455e-02  10.040  < 2e-16 ***
population  -4.170e-02  4.158e-03 -10.029  < 2e-16 ***
province2   -1.266e+04  1.589e+03  -7.965 5.01e-13 ***
province3    2.448e+04  2.136e+03  11.460  < 2e-16 ***
province4    1.911e+04  1.779e+03  10.743  < 2e-16 ***
province5    2.597e+05  1.965e+04  13.214  < 2e-16 ***
province6    4.244e+05  3.032e+04  13.996  < 2e-16 ***
province7    3.044e+04  2.543e+03  11.969  < 2e-16 ***
province8    7.002e+04  1.714e+03  40.845  < 2e-16 ***
province9    2.886e+05  5.700e+03  50.635  < 2e-16 ***
province10   1.600e+05  1.023e+04  15.638  < 2e-16 ***
province11  -1.707e+04  1.739e+03  -9.812  < 2e-16 ***
province12  -1.628e+04  1.760e+03  -9.247 3.30e-16 ***
province13  -1.710e+04  1.743e+03  -9.810  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3540 on 141 degrees of freedom
Multiple R-squared:  0.9982,    Adjusted R-squared:  0.998 
F-statistic:  5590 on 14 and 141 DF,  p-value: < 2.2e-16
car::vif(mlr_4)
                 GVIF Df GVIF^(1/(2*Df))
gdp          917.3174  1       30.287249
population  3349.2441  1       57.872654
province   15037.0574 12        1.492961
#GVIF stands for Generalized Variance Inflation Factors, and appears here to show the combined VIF of all the province coefficients. 

Unsurprisingly, this model has high multicollinearity. We expect this because we understand that things like population and gdp vary significantly with province. Recall how the data points were clustered around different centers in the original visualizations we made? If we summarize these mean values in a table, we can clearly see just how variable they are.

a + geom_point()

b + geom_point()

c + geom_point()

#create a summary table organized by province
sumtable(CO2_data, 
      vars = c("gdp", "population", "electricity"),
       summ = c('mean(x)'),
       group = 'province',
       digits = 6,
       out = 'return')
Warning in sumtable(CO2_data, vars = c("gdp", "population", "electricity"), : Factor variables ignore custom summ options. Cols 1 and 2 are count and percentage.
Beware combining factors with a custom summ unless factor.numeric = TRUE.
     Variable    Mean    Mean    Mean    Mean     Mean     Mean    Mean    Mean
1    province       1       2       3       4        5        6       7       8
2         gdp 30867.9 5354.27 35442.1 29648.8   342227   671022 58235.8 76860.4
3  population  525234  147399  948649  762457  8181472 13744892 1289769 1112535
4 electricity 3421574 33549.2  861659 1081799 16943730 12382425 2904070 1948381
     Mean    Mean    Mean    Mean    Mean
1       9      10      11      12      13
2  310673  222483 2452.57 4518.11 2472.95
3 4057539 4751827 37614.6 44185.9 35973.2
4 5789208 5502451 37739.5 58013.7 15103.4

Let’s see what happens when we interact some of the terms.

#Province interacted with population 
mlr_5 <- lm(CO2 ~ gdp + electricity + population*province, data = CO2_data)
summary(mlr_5)

Call:
lm(formula = CO2 ~ gdp + electricity + population * province, 
    data = CO2_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9332.6  -296.7    -8.1   529.7  8933.2 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -5.129e+04  1.035e+05  -0.495 0.621114    
gdp                    4.881e-01  3.171e-02  15.394  < 2e-16 ***
electricity           -2.535e-03  9.255e-04  -2.739 0.007045 ** 
population             1.046e-01  1.975e-01   0.529 0.597388    
province2              5.470e+04  1.050e+05   0.521 0.603122    
province3              2.144e+05  1.166e+05   1.839 0.068247 .  
province4              1.875e+05  1.195e+05   1.569 0.119058    
province5              3.623e+05  1.074e+05   3.372 0.000986 ***
province6              6.204e+05  1.065e+05   5.825 4.37e-08 ***
province7              6.796e+04  1.049e+05   0.648 0.518420    
province8              1.052e+05  1.049e+05   1.002 0.318095    
province9              2.318e+05  1.040e+05   2.227 0.027669 *  
province10             2.066e+05  1.047e+05   1.973 0.050654 .  
province11             5.186e+04  1.041e+05   0.498 0.619027    
province12             4.885e+04  1.116e+05   0.438 0.662446    
province13             5.202e+04  1.043e+05   0.499 0.618850    
population:province2  -1.332e-01  2.299e-01  -0.579 0.563488    
population:province3  -2.725e-01  2.056e-01  -1.326 0.187260    
population:province4  -2.741e-01  2.123e-01  -1.291 0.199056    
population:province5  -1.468e-01  1.973e-01  -0.744 0.458264    
population:province6  -1.552e-01  1.975e-01  -0.786 0.433275    
population:province7  -1.177e-01  1.979e-01  -0.595 0.553035    
population:province8  -1.139e-01  1.980e-01  -0.575 0.566212    
population:province9  -1.147e-01  1.973e-01  -0.581 0.562212    
population:province10 -1.405e-01  1.975e-01  -0.711 0.478172    
population:province11 -1.333e-01  3.432e-01  -0.388 0.698415    
population:province12 -4.976e-02  9.665e-01  -0.051 0.959023    
population:province13 -1.395e-01  4.091e-01  -0.341 0.733628    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2498 on 128 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.999 
F-statistic:  5826 on 27 and 128 DF,  p-value: < 2.2e-16
#Province interacted with gdp
mlr_6 <- lm(CO2 ~ population + electricity + gdp*province, data = CO2_data)
summary(mlr_6)

Call:
lm(formula = CO2 ~ population + electricity + gdp * province, 
    data = CO2_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8326.6  -381.2     2.8   458.3  5158.7 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.005e+04  2.093e+04   0.480  0.63180    
population     -3.178e-02  2.460e-03 -12.919  < 2e-16 ***
electricity    -1.549e-03  6.764e-04  -2.290  0.02364 *  
gdp             7.122e-01  6.782e-01   1.050  0.29570    
province2      -5.698e+03  2.241e+04  -0.254  0.79969    
province3       8.739e+04  2.622e+04   3.333  0.00112 ** 
province4       6.001e+04  3.141e+04   1.911  0.05825 .  
province5       2.466e+05  2.484e+04   9.926  < 2e-16 ***
province6       4.385e+05  2.673e+04  16.408  < 2e-16 ***
province7       1.409e+04  2.242e+04   0.629  0.53066    
province8       3.856e+04  2.241e+04   1.721  0.08774 .  
province9       1.784e+05  2.224e+04   8.019 5.81e-13 ***
province10      1.323e+05  2.207e+04   5.994 1.95e-08 ***
province11     -9.304e+03  2.287e+04  -0.407  0.68480    
province12     -7.810e+03  2.414e+04  -0.323  0.74686    
province13     -9.361e+03  2.115e+04  -0.443  0.65876    
gdp:province2  -3.183e-01  1.638e+00  -0.194  0.84627    
gdp:province3  -2.040e+00  8.140e-01  -2.506  0.01348 *  
gdp:province4  -1.570e+00  1.041e+00  -1.508  0.13400    
gdp:province5  -3.633e-01  6.783e-01  -0.536  0.59317    
gdp:province6  -4.483e-01  6.787e-01  -0.661  0.51007    
gdp:province7   1.222e-02  6.931e-01   0.018  0.98596    
gdp:province8   1.455e-01  6.861e-01   0.212  0.83243    
gdp:province9   1.550e-02  6.784e-01   0.023  0.98181    
gdp:province10 -2.776e-01  6.792e-01  -0.409  0.68344    
gdp:province11 -2.626e-01  3.809e+00  -0.069  0.94513    
gdp:province12 -4.268e-01  2.746e+00  -0.155  0.87673    
gdp:province13 -2.600e-01  1.383e+00  -0.188  0.85118    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1869 on 128 degrees of freedom
Multiple R-squared:  0.9995,    Adjusted R-squared:  0.9994 
F-statistic: 1.042e+04 on 27 and 128 DF,  p-value: < 2.2e-16
mlr_7 <- lm(CO2 ~ population + electricity + gdp:province, data = CO2_data)
summary(mlr_7)

Call:
lm(formula = CO2 ~ population + electricity + gdp:province, data = CO2_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-31549.3  -1230.8    119.6   1433.9  12645.2 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.940e+04  5.781e+03   3.356 0.001017 ** 
population      2.479e-02  3.008e-03   8.242 1.10e-13 ***
electricity    -4.542e-03  1.686e-03  -2.694 0.007929 ** 
gdp:province1  -2.209e-01  2.320e-01  -0.952 0.342657    
gdp:province2  -3.938e+00  1.079e+00  -3.650 0.000369 ***
gdp:province3  -5.686e-01  1.494e-01  -3.806 0.000211 ***
gdp:province4  -4.955e-01  1.803e-01  -2.748 0.006780 ** 
gdp:province5  -1.623e-01  6.453e-02  -2.515 0.013043 *  
gdp:province6  -2.007e-01  4.583e-02  -4.380 2.31e-05 ***
gdp:province7  -2.966e-01  9.859e-02  -3.009 0.003109 ** 
gdp:province8   4.945e-01  6.950e-02   7.115 5.32e-11 ***
gdp:province9   5.870e-01  2.874e-02  20.424  < 2e-16 ***
gdp:province10 -1.475e-01  4.504e-02  -3.275 0.001330 ** 
gdp:province11 -7.950e+00  2.400e+00  -3.312 0.001178 ** 
gdp:province12 -4.018e+00  1.303e+00  -3.083 0.002466 ** 
gdp:province13 -7.660e+00  2.316e+00  -3.307 0.001198 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4859 on 140 degrees of freedom
Multiple R-squared:  0.9966,    Adjusted R-squared:  0.9963 
F-statistic:  2765 on 15 and 140 DF,  p-value: < 2.2e-16
vif(mlr_7)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
                   GVIF Df GVIF^(1/(2*Df))
population     930.5428  1       30.504800
electricity    479.2521  1       21.891827
gdp:province 62227.6486 13        1.528917

In the final model the relationships appear to be significant. We would interpret the coefficients of the interacted terms as the combined effect of gdp and province- or how the effect of gdp on CO2 emissions changes by province. However this model is still severely affected by multicollinearity.

Now that we have a few different models, let’s compare some of our results.

stargazer(mlr_1, mlr_2, mlr_3, mlr_4, mlr_5, mlr_6, mlr_7, title="Comparison of Muliple Regression Results",
          align = TRUE, type="text", keep.stat = c("n","rsq"))

Comparison of Muliple Regression Results
=============================================================================================================================
                                                                Dependent variable:                                          
                      -------------------------------------------------------------------------------------------------------
                                                                        CO2                                                  
                          (1)            (2)             (3)            (4)            (5)            (6)            (7)     
-----------------------------------------------------------------------------------------------------------------------------
gdp                     1.601***                       0.446***       0.447***       0.488***        0.712                   
                        (0.052)                        (0.044)        (0.045)        (0.032)        (0.678)                  
                                                                                                                             
population             -0.072***                      -0.043***      -0.042***        0.105        -0.032***      0.025***   
                        (0.003)                        (0.004)        (0.004)        (0.198)        (0.002)        (0.003)   
                                                                                                                             
per_capita_gdp                     1,181,707.000***                                                                          
                                    (275,328.000)                                                                            
                                                                                                                             
electricity             0.007***       0.010***         0.001                       -0.003***       -0.002**      -0.005***  
                        (0.001)        (0.001)         (0.001)                       (0.001)        (0.001)        (0.002)   
                                                                                                                             
province2                                            -8,715.716**  -12,657.660***   54,701.130     -5,698.181                
                                                     (3,588.408)    (1,589.254)   (104,950.400)   (22,409.260)               
                                                                                                                             
province3                                           28,580.740***  24,477.550***   214,407.000*  87,391.290***               
                                                     (3,971.206)    (2,135.841)   (116,594.900)   (26,219.690)               
                                                                                                                             
province4                                           22,621.500***  19,109.350***   187,504.800    60,012.200*                
                                                     (3,372.956)    (1,778.792)   (119,486.700)   (31,405.210)               
                                                                                                                             
province5                                           253,686.500*** 259,660.400*** 362,302.500*** 246,554.500***              
                                                     (20,213.450)   (19,650.820)  (107,433.500)   (24,838.720)               
                                                                                                                             
province6                                           433,624.300*** 424,413.600*** 620,438.800*** 438,499.700***              
                                                     (31,190.820)   (30,324.170)  (106,520.500)   (26,725.360)               
                                                                                                                             
province7                                           32,350.330***  30,438.110***    67,959.810     14,094.200                
                                                     (2,980.379)    (2,543.155)   (104,945.200)   (22,417.830)               
                                                                                                                             
province8                                           72,979.970***  70,024.700***   105,164.300    38,556.060*                
                                                     (2,958.300)    (1,714.404)   (104,924.000)   (22,408.380)               
                                                                                                                             
province9                                           291,305.800*** 288,642.100*** 231,758.100**  178,357.500***              
                                                     (6,091.873)    (5,700.458)   (104,049.400)   (22,241.170)               
                                                                                                                             
province10                                          163,933.100*** 159,952.700***  206,555.500*  132,276.200***              
                                                     (10,715.320)   (10,228.620)  (104,692.900)   (22,068.520)               
                                                                                                                             
province11                                          -13,303.480*** -17,066.800***   51,864.580     -9,303.639                
                                                     (3,529.484)    (1,739.438)   (104,053.200)   (22,867.460)               
                                                                                                                             
province12                                          -12,527.140*** -16,276.430***   48,846.790     -7,809.713                
                                                     (3,529.727)    (1,760.098)   (111,635.700)   (24,142.670)               
                                                                                                                             
province13                                          -13,307.800*** -17,099.110***   52,020.870     -9,361.298                
                                                     (3,551.189)    (1,743.108)   (104,313.900)   (21,147.610)               
                                                                                                                             
population:province2                                                                  -0.133                                 
                                                                                     (0.230)                                 
                                                                                                                             
population:province3                                                                  -0.273                                 
                                                                                     (0.206)                                 
                                                                                                                             
population:province4                                                                  -0.274                                 
                                                                                     (0.212)                                 
                                                                                                                             
population:province5                                                                  -0.147                                 
                                                                                     (0.197)                                 
                                                                                                                             
population:province6                                                                  -0.155                                 
                                                                                     (0.197)                                 
                                                                                                                             
population:province7                                                                  -0.118                                 
                                                                                     (0.198)                                 
                                                                                                                             
population:province8                                                                  -0.114                                 
                                                                                     (0.198)                                 
                                                                                                                             
population:province9                                                                  -0.115                                 
                                                                                     (0.197)                                 
                                                                                                                             
population:province10                                                                 -0.140                                 
                                                                                     (0.197)                                 
                                                                                                                             
population:province11                                                                 -0.133                                 
                                                                                     (0.343)                                 
                                                                                                                             
population:province12                                                                 -0.050                                 
                                                                                     (0.966)                                 
                                                                                                                             
population:province13                                                                 -0.140                                 
                                                                                     (0.409)                                 
                                                                                                                             
gdp:province1                                                                                                      -0.221    
                                                                                                                   (0.232)   
                                                                                                                             
gdp:province2                                                                                        -0.318       -3.938***  
                                                                                                    (1.638)        (1.079)   
                                                                                                                             
gdp:province3                                                                                       -2.040**      -0.569***  
                                                                                                    (0.814)        (0.149)   
                                                                                                                             
gdp:province4                                                                                        -1.570       -0.495***  
                                                                                                    (1.041)        (0.180)   
                                                                                                                             
gdp:province5                                                                                        -0.363       -0.162**   
                                                                                                    (0.678)        (0.065)   
                                                                                                                             
gdp:province6                                                                                        -0.448       -0.201***  
                                                                                                    (0.679)        (0.046)   
                                                                                                                             
gdp:province7                                                                                        0.012        -0.297***  
                                                                                                    (0.693)        (0.099)   
                                                                                                                             
gdp:province8                                                                                        0.145        0.494***   
                                                                                                    (0.686)        (0.069)   
                                                                                                                             
gdp:province9                                                                                        0.015        0.587***   
                                                                                                    (0.678)        (0.029)   
                                                                                                                             
gdp:province10                                                                                       -0.278       -0.148***  
                                                                                                    (0.679)        (0.045)   
                                                                                                                             
gdp:province11                                                                                       -0.263       -7.950***  
                                                                                                    (3.809)        (2.400)   
                                                                                                                             
gdp:province12                                                                                       -0.427       -4.018***  
                                                                                                    (2.746)        (1.303)   
                                                                                                                             
gdp:province13                                                                                       -0.260       -7.660***  
                                                                                                    (1.383)        (2.316)   
                                                                                                                             
Constant              7,518.300***  -47,138.540***  14,385.140***  18,137.770***   -51,288.520     10,054.610   19,400.810***
                      (2,375.031)    (17,769.030)    (3,390.769)    (1,454.672)   (103,513.700)   (20,931.750)   (5,780.503) 
                                                                                                                             
-----------------------------------------------------------------------------------------------------------------------------
Observations              156            156             156            156            156            156            156     
R2                       0.919          0.391           0.998          0.998          0.999          1.000          0.997    
=============================================================================================================================
Note:                                                                                             *p<0.1; **p<0.05; ***p<0.01

From the results of our simple regression, we know that there is a strong relationship between between province and [\(CO_2\)] emissions. But to choose the best model based on the data we have, we need to consider the models holistically, paying attention to the common issues that arise in regression, and not just the explanatory power alone.

Summary:

In conclusion, we note that CO2 Emissions in the Canadian Economy are severely impacted by the variables of our consideration: GDP, Electricity Generation, and Population. It is also important to note that we ought to explore our data by running multiple regressions, and clean it appropriately before running our analysis. This exploratory practice may lead us relationships that we expect, or something completely contrasting. Therefore, use this as a sample for your own project, and keep researching!

Exercises:

The below exercises are intended to help you check your conceptual understanding of the content.

Exercise 1:

What is an advantage of a multiple regression model over a simple regression model? Pick the best answer.

A: Multiple regression models often exhibit multicollinearity, which gives them better explanatory power compared to simple regressions.

B: Multiple regression models improve the predictive properties of a model by adding multiple regressors that play a role in the relationship.

C: Multiple regression models can display complicated relationships, but simple regression models are too simple to ever be useful.

D: By having multiple variables, multiple regressions allow us to see the interactions between different variables in a relationship.

answer_1 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_1()

Exercise 2:

Which of the below would not be a way of resolving heteroskedasticity within a regression model?

A: Performing a log transformation on the outcome variable

B: Using the appropriate code to generate HC1 standard errors

C: Searching for a different dataset to use

D: Adding more explanatory variables to the model (so long as all of the individual relationships remain significant)

answer_2 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_2()

Exercise 3:

What does a variance inflation factor of a variable tell us?

A: The magnitude of the heteroskedasticity (variance of error terms) in the data for a given variable.

B: The extent to which the variability of a dependent variable is inflated by multicollinearity in the model.

C: Whether or not there is multicollinearity in our model.

D: The extent to which the model is inflated by ommitted variable bias.

answer_3 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_3()

Exercise 4:

In what situation would it be appropriate to incorporate an interaction into your model? Pick the best answer.

A: There is reason to believe that the combined effect of two or more variables has an impact on the outcome variable.

B: There is reason to believe that the combined effect of two or more continuous variables has an impact on the outcome variable.

C: Your model is displaying multicollinearity with two variables.

D: Your model is boring, and you want to make it more interesting.

answer_4 <- '...' # your answer here ('A', 'B', 'C', or 'D')

test_4()
  • Creative Commons License. See details.
 
  • Report an issue
  • The COMET Project and the UBC Vancouver School of Economics are located on the traditional, ancestral and unceded territory of the xʷməθkʷəy̓əm (Musqueam) and Sḵwx̱wú7mesh (Squamish) peoples.