Intermediate - Summary Statistics

econ 326

econ 490

data

summary statistics

regression

analysis

control variables

intermediate

An introduction how to create summary statistics tables in R using the gtsummary package and gives examples of how to create and present effective summary statistics in a research project.

Author

COMET Team
Irene Berezin, Alex Ronczewski

Published

1 March 2025

Outline

Prerequisites

Intermediate knowledge of R and Basic Jupyter Skills
An understanding of Tables and Basic Statistics
Knowledge of Regressions

Outcomes

By the end of this notebook, you will be able to:

Install and Use the gtsummary Package
Use gtsummary to create basic summary tables
Use gtsummary to create stratified tables
Use gtsummary to customize the tables (Headers, Footnotes, colors and sizes)
Use gtsummary to create tables for regression output
Use gtsummary to Create tables for Projects in classes like Econ 326 or Econ 490

Part 1: Using `gtsummary`

Firstly let’s introduce gtsummary: gtsummary is an R package designed to create publication-ready analytical and summary tables with ease and flexibility. It offers a powerful set of tools for data scientists, statisticians, and researchers working with R to summarize datasets, regression models, and more. It is one of the best packages at this task and is a useful tool to use for projects, research or other statistical uses.

Let’s learn how to use it!

Running the next code block will install necessary packages, you may be prompted to choose a CRAN mirror, just choose the closest geographical one, example for Vancouver choose US (OR). This will be all of the packages needed for the notebook.

# Uncomment and run this cell to install necessary packages

# install.packages("broom.helpers")
# install.packages("jsonlite")
# install.packages("NHANES")
# install.packages("gtsummary") #This is our key gtsummary package which is the topic of this notebook

This will run the packages so we can use them for the code below.

library(jsonlite)

Warning: package 'jsonlite' was built under R version 4.4.2

library(gt)

Warning: package 'gt' was built under R version 4.4.3

library(gtsummary) #This is our key gtsummary package which is the topic of this notebook

Warning: package 'gtsummary' was built under R version 4.4.3

library(dplyr)

Warning: package 'dplyr' was built under R version 4.4.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.3

Warning: package 'tibble' was built under R version 4.4.2

Warning: package 'tidyr' was built under R version 4.4.2

Warning: package 'readr' was built under R version 4.4.2

Warning: package 'purrr' was built under R version 4.4.3

Warning: package 'stringr' was built under R version 4.4.2

Warning: package 'forcats' was built under R version 4.4.2

Warning: package 'lubridate' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(NHANES)

Warning: package 'NHANES' was built under R version 4.4.3

This is our dataset which we will be using for the first section of this notebook. We have chosen this dataset as it is available within R as a package and does not have to be installed separately. The NHANES dataset is a subset of data from the National Health and Nutrition Examination Survey, a comprehensive study designed to assess the health and nutritional status of adults and children in the United States.

Please run the cell below and see if the head command correctly displays the data.

#Check if the data is properly loaded
data(NHANES)
head(NHANES, 20)

# A tibble: 20 × 76
      ID SurveyYr Gender   Age AgeDecade AgeMonths Race1   Race3 Education     
   <int> <fct>    <fct>  <int> <fct>         <int> <fct>   <fct> <fct>         
 1 51624 2009_10  male      34 " 30-39"        409 White   <NA>  High School   
 2 51624 2009_10  male      34 " 30-39"        409 White   <NA>  High School   
 3 51624 2009_10  male      34 " 30-39"        409 White   <NA>  High School   
 4 51625 2009_10  male       4 " 0-9"           49 Other   <NA>  <NA>          
 5 51630 2009_10  female    49 " 40-49"        596 White   <NA>  Some College  
 6 51638 2009_10  male       9 " 0-9"          115 White   <NA>  <NA>          
 7 51646 2009_10  male       8 " 0-9"          101 White   <NA>  <NA>          
 8 51647 2009_10  female    45 " 40-49"        541 White   <NA>  College Grad  
 9 51647 2009_10  female    45 " 40-49"        541 White   <NA>  College Grad  
10 51647 2009_10  female    45 " 40-49"        541 White   <NA>  College Grad  
11 51654 2009_10  male      66 " 60-69"        795 White   <NA>  Some College  
12 51656 2009_10  male      58 " 50-59"        707 White   <NA>  College Grad  
13 51657 2009_10  male      54 " 50-59"        654 White   <NA>  9 - 11th Grade
14 51659 2009_10  female    10 " 10-19"        123 White   <NA>  <NA>          
15 51666 2009_10  female    58 " 50-59"        700 Mexican <NA>  High School   
16 51667 2009_10  male      50 " 50-59"        603 White   <NA>  Some College  
17 51671 2009_10  female     9 " 0-9"          112 Black   <NA>  <NA>          
18 51677 2009_10  male      33 " 30-39"        404 White   <NA>  High School   
19 51678 2009_10  male      60 " 60-69"        721 White   <NA>  High School   
20 51679 2009_10  male      16 " 10-19"        194 Other   <NA>  <NA>          
# ℹ 67 more variables: MaritalStatus <fct>, HHIncome <fct>, HHIncomeMid <int>,
#   Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
#   Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
#   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>,
#   BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>,
#   BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>,
#   TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>, …

Summary Tables

Here we will make a basic summary table with the variables Age, BMI, Weight, Race1, Education, Gender which will show an overview of the dataset. Along with this we will have first mean and then in brackets standard deviation and for specific variables percent of n within each category.

# Load the NHANES data
data(NHANES)

# Create a summary table for selected variables with our NHANES dataset
summary_table <- NHANES %>%
  select(Age, BMI, Weight, Race1, Education, Gender) %>% #Here we select the variables we want on our summary table
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})", #all continuous variables will have mean and standard deviation 
      all_categorical() ~ "{n} ({p}%)" #all categorical variables will have n and p included
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  modify_header(label = "**Variable**") %>% #Using  modify_header we can change the label
  bold_labels()

# Print the summary table
summary_table

Variable	N = 10,000¹
Age	36.7 (22.4)
BMI	26.7 (7.4)
Weight	71.0 (29.1)
Race1
Black	1,197 (12%)
Hispanic	610 (6.1%)
Mexican	1,015 (10%)
White	6,372 (64%)
Other	806 (8.1%)
Education
8th Grade	451 (6.2%)
9 - 11th Grade	888 (12%)
High School	1,517 (21%)
Some College	2,267 (31%)
College Grad	2,098 (29%)
Gender
female	5,020 (50%)
male	4,980 (50%)
¹ Mean (SD); n (%)

To show a more complex table we can also use the by command to sort by if the observation has diabetes. This is called stratifying by group variables. Here the key function is by where by setting it to the variable Diabetes we can show the summary statistics for three different groups Overall, No Diabetes and Diagnosed with Diabetes.

We can also use the command add_ to add certain statistics to our table for instance here we will use add_n() to show how many people in the dataset have diabetes. We can use add_ for different things apart from number of observations for instance add_p will show p values. More information on what can be added can be found at tbl_summary() tutorial under the header “{gtsummary} functions to add information.”

#Same as before make a table variable and select variables in the dataset using select(...)
summary_table2 <- NHANES %>%
  select(Age, BMI, Weight, Race1, Gender, Diabetes) %>%
  tbl_summary(
    by = Diabetes, #Here we can stratify our table by people who have diabetes. 
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n}"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_overall() %>%
  add_n() %>% #Add Number of Observations
  modify_header(label = "**Variable**") %>%
  bold_labels() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Diabetes Status**") #modify_spanning_header is the overall title of our table

142 missing rows in the "Diabetes" column have been removed.

#Print the table
summary_table2

Variable	N	Overall N = 9,858¹	Diabetes Status
Variable	N	Overall N = 9,858¹	No N = 9,098¹	Yes N = 760¹
Age	9,858	37.3 (22.1)	35.4 (21.6)	59.2 (14.7)
BMI	9,629	26.7 (7.4)	26.2 (7.1)	32.6 (8.2)
Weight	9,780	71.9 (28.3)	70.2 (28.0)	91.6 (25.3)
Race1	9,858
Black		1,184	1,053	131
Hispanic		602	555	47
Mexican		991	925	66
White		6,290	5,840	450
Other		791	725	66
Gender	9,858
female		4,949	4,592	357
male		4,909	4,506	403
¹ Mean (SD); n

We can also make custom headers and footnotes using the command modify_header, modify_footnoteand modify_caption. This is shown in the codeblock below where we will create a table with a custom header, title and footnotes.

#Same as before but we to focus our code we will start from our output of the stratified table from summary_table2. 
summary_table3 <- summary_table2 %>%
  modify_header( #This command will modify the header in the table
    stat_0 ~ "**All Observations**", 
    stat_1 ~ "**No Diabetes**", 
    stat_2 ~ "**Diagnosed with Diabetes**"
  ) %>%
  modify_footnote( 
    all_stat_cols() ~ "Median (IQR) for continuous variables; n (%) for categorical variables" #Here we change the footnote with the command modify_footnote
  ) %>%
  modify_caption("**Table 1. Demographic and Clinical Characteristics**") #And here we change the overall title

# Print the table 
summary_table3

**Table 1. Demographic and Clinical Characteristics**
Variable	N	All Observations¹	Diabetes Status
Variable	N	All Observations¹	No Diabetes¹	Diagnosed with Diabetes¹
Age	9,858	37.3 (22.1)	35.4 (21.6)	59.2 (14.7)
BMI	9,629	26.7 (7.4)	26.2 (7.1)	32.6 (8.2)
Weight	9,780	71.9 (28.3)	70.2 (28.0)	91.6 (25.3)
Race1	9,858
Black		1,184	1,053	131
Hispanic		602	555	47
Mexican		991	925	66
White		6,290	5,840	450
Other		791	725	66
Gender	9,858
female		4,949	4,592	357
male		4,909	4,506	403
¹ Median (IQR) for continuous variables; n (%) for categorical variables

We can also change stylistic choices, this is the same table as before information wise, but it has been substantially changed visually. gtsummary offers many options to change the look of our tables. Below is an example where we will drastically change the look of the table using the commands tab_options and tab_style, you are free to experiment and change the code here to see how the colors, borders and labels can change.

summary_table4 <- NHANES %>% #create the variable 
  select(Age, BMI, Weight, Race1, Gender, Diabetes) %>% #select variables 
  tbl_summary(
    by = Diabetes, #use the by command to stratify the table 
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n}"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_overall() %>%
  add_n() %>%
  modify_header(label = "**Variable**") %>%
  bold_labels() %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Diabetes Status**") %>%
  modify_header(
    stat_0 ~ "**All Observations**", 
    stat_1 ~ "**No Diabetes**", 
    stat_2 ~ "**Diabetes**"
  ) %>%
  modify_footnote( #change footnote label 
    all_stat_cols() ~ "Median (IQR) for continuous variables; n (%) for categorical variables"
  ) %>%
  modify_caption("**Table 1. Demographic and Clinical Characteristics**") %>%
  as_gt() %>%
  tab_options( #Change tab options like font, font size, header styles 
    table.font.name = "Arial",
    table.font.size = 12,
    heading.title.font.size = 16,
    column_labels.font.weight = "bold",
    table.border.top.style = "solid",
    table.border.bottom.style = "solid",
    column_labels.border.top.style = "solid",
    column_labels.border.bottom.style = "solid",
    data_row.padding = px(10),
    source_notes.font.size = 10,
    table.width = pct(100)
  ) %>%
  tab_style( #Change colors of the cells 
    style = list(
      cell_fill(color = "lightblue"),
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels()
  ) %>%
  tab_style( 
    style = cell_text(color = "navy"),
    locations = cells_body(
      columns = vars(stat_0)
    )
  ) %>%
  tab_style( #Change cell border colors 
    style = cell_borders(
      sides = "right",
      color = "grey",
      weight = px(1)
    ),
    locations = cells_body(
      columns = everything()
    )
  )

142 missing rows in the "Diabetes" column have been removed.

Warning: Since gt v0.3.0, `columns = vars(...)` has been deprecated.
• Please use `columns = c(...)` instead.

# Print the table 
summary_table4

**Table 1. Demographic and Clinical Characteristics**
Variable	N	All Observations¹	Diabetes Status
Variable	N	All Observations¹	No Diabetes¹	Diabetes¹
Age	9,858	37.3 (22.1)	35.4 (21.6)	59.2 (14.7)
BMI	9,629	26.7 (7.4)	26.2 (7.1)	32.6 (8.2)
Weight	9,780	71.9 (28.3)	70.2 (28.0)	91.6 (25.3)
Race1	9,858
Black		1,184	1,053	131
Hispanic		602	555	47
Mexican		991	925	66
White		6,290	5,840	450
Other		791	725	66
Gender	9,858
female		4,949	4,592	357
male		4,909	4,506	403
¹ Median (IQR) for continuous variables; n (%) for categorical variables

Regression Tables

Apart from summary tables gtsummary can also be used for regression analysis. Here we will continue to use the NHANES dataset, and the regression formula we will use is:

\[ \texttt{TotChol}_i = \beta_0 + \beta_1 \cdot \texttt{Age}_i + \beta_2 \cdot \texttt{BMI}_i + \beta_3 \cdot \texttt{Alcohol12PlusYr}_i + \beta_4 \cdot \texttt{PhysActive}_i + \beta_5 \cdot \texttt{SleepHrsNight}_i + \varepsilon_i \]

Here we will make a simple table showing our regression results, with our beta coefficients, 95% confidence intervals, and \(p\)-values.

# Prepare the data
nhanes_data <- NHANES %>%
  select(TotChol, Age, BMI, Alcohol12PlusYr, PhysActive, SleepHrsNight) %>% #Select variables like above
  na.omit()  # Remove rows with missing values

# Our regression model from above
model <- lm(TotChol ~ Age + BMI + Alcohol12PlusYr + PhysActive + SleepHrsNight, data = nhanes_data)

# Create a summary table using gtsummary
tbl_summary <- tbl_regression(model, 
               label = list( #Set labels for our variables 
                 Age ~ "Age (years)",
                 BMI ~ "Body Mass Index",
                 Alcohol12PlusYr ~ "Alcohol consumption (12+ drinks/year)",
                 PhysActive ~ "Physically active",
                 SleepHrsNight ~ "Sleep hours per night"
               )) %>%
  add_global_p() %>% #Add a p-value
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  modify_header(label = "**Variable**") %>% #Change headers like for summary tables
  modify_spanning_header(c("estimate", "conf.low", "conf.high") ~ "**Coefficient (95% CI)**") %>% #Spaning header
  modify_caption("**Table 1. Linear Regression Model Predicting Total Cholesterol**") #Changes the caption

# Print the table
tbl_summary

**Table 1. Linear Regression Model Predicting Total Cholesterol**
Variable	Coefficient (95% CI)		p-value
Variable	Beta	95% CI	p-value
Age (years)	0.01	0.01, 0.01	<0.001
Body Mass Index	0.00	0.00, 0.00	0.7
Alcohol consumption (12+ drinks/year)			<0.001
No	—	—
Yes	0.15	0.08, 0.21
Physically active			>0.9
No	—	—
Yes	0.00	-0.06, 0.05
Sleep hours per night	0.00	-0.02, 0.02	0.8
Abbreviation: CI = Confidence Interval

Here we see that the regression table is good, but it can be improved, for instance we can change the number of significant figures on the coefficient values currently it is three and we can not clearly see the result of our regression analysis. Let’s make it 5 for the Beta’s and coefficient values. Along with this we are also missing Standard Errors and t-values which can show a more complete analysis. We will add them using the code below.

# Prepare the data for our model
nhanes_data <- NHANES %>%
  select(TotChol, Age, BMI, Alcohol12PlusYr, PhysActive, SleepHrsNight) %>% #Select Variables 
  na.omit()  # Remove rows with missing values

# Regression Model like above
model <- lm(TotChol ~ Age + BMI + Alcohol12PlusYr + PhysActive + SleepHrsNight, data = nhanes_data)

# GT table
tbl_summary2 <- tbl_regression(model, 
               label = list( #Labels for our variables 
                 Age ~ "Age (years)",
                 BMI ~ "Body Mass Index",
                 Alcohol12PlusYr ~ "Alcohol consumption (12+ drinks/year)",
                 PhysActive ~ "Physically active",
                 SleepHrsNight ~ "Sleep hours per night"
               )) %>%
  modify_fmt_fun(estimate ~ function(x) style_sigfig(x, digits = 5)) %>%
  modify_fmt_fun(c(conf.low, conf.high) ~ function(x) style_sigfig(x, digits = 5)) %>%
  modify_column_unhide(columns = c(statistic, std.error)) %>% #Here we add standard Errors
  modify_header(statistic ~ "**t-value**", std.error ~ "**SE**") %>% #Here we add t-value and SE
  modify_fmt_fun(statistic ~ function(x) style_sigfig(x, digits = 5)) %>%
  modify_fmt_fun(std.error ~ function(x) style_sigfig(x, digits = 5)) %>%
  add_global_p() %>% #P-values
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  modify_header(label = "**Variable**") %>% #Like above modify labels for header
  modify_spanning_header(c("estimate", "conf.low", "conf.high") ~ "**Coefficients and Findings**") %>% #Like above modify labels for spanning header 
  modify_caption("**Table 1. Linear Regression Model Predicting Total Cholesterol**") #Like above modify labels for caption

# Print our new table
tbl_summary2

**Table 1. Linear Regression Model Predicting Total Cholesterol**
Variable	Coefficients and Findings	SE	t-value	Coefficients and Findings	p-value
Variable	Beta	SE	t-value	95% CI	p-value
Age (years)	0.00933	0.00080	11.703	0.00777, 0.01089	<0.001
Body Mass Index	0.00079	0.00204	0.38607	-0.00321, 0.00479	0.7
Alcohol consumption (12+ drinks/year)					<0.001
No	—	—	—	—
Yes	0.14708	0.03344	4.3981	0.08152, 0.21263
Physically active					>0.9
No	—	—	—	—
Yes	-0.00160	0.02781	-0.05748	-0.05611, 0.05291
Sleep hours per night	0.00202	0.01000	0.20159	-0.01759, 0.02162	0.8
Abbreviations: CI = Confidence Interval, SE = Standard Error

Just like above we can do all kinds of cosmetic changes to our new regression summary tables; for instance we can change colors or fonts. Once again feel free to change fonts or colors and see how the end result changes.

# Prepare the data
nhanes_data <- NHANES %>%
  select(TotChol, Age, BMI, Alcohol12PlusYr, PhysActive, SleepHrsNight) %>%
  na.omit()  # Remove rows with missing values

# Regression Model
model <- lm(TotChol ~ Age + BMI + Alcohol12PlusYr + PhysActive + SleepHrsNight, data = nhanes_data)

# Create and modify the gtsummary table
tbl_summary3 <- tbl_regression(model,
               label = list( #Labels for our varaibles 
                 Age ~ "Age (years)",
                 BMI ~ "Body Mass Index",
                 Alcohol12PlusYr ~ "Alcohol consumption (12+ drinks/year)",
                 PhysActive ~ "Physically active",
                 SleepHrsNight ~ "Sleep hours per night"
               )) %>%
  modify_fmt_fun(estimate ~ function(x) style_sigfig(x, digits = 5)) %>%
  modify_fmt_fun(c(conf.low, conf.high) ~ function(x) style_sigfig(x, digits = 5)) %>%
  modify_column_unhide(columns = c(statistic, std.error)) %>%
  modify_header( #Headers
    label ~ "**Variable**",
    estimate ~ "**Beta**",
    std.error ~ "**SE**",
    statistic ~ "**t-value**",
    p.value ~ "**p-value**"
  ) %>%
  modify_fmt_fun(statistic ~ function(x) style_sigfig(x, digits = 5)) %>%
  modify_fmt_fun(std.error ~ function(x) style_sigfig(x, digits = 5)) %>%
  add_global_p() %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  modify_spanning_header(c(estimate, conf.low, conf.high) ~ "**Coefficients and Findings**") %>% #Spanning Header
  modify_caption("**Table 1. Linear Regression Model Predicting Total Cholesterol**") #Caption of Table

# Convert to gt object and apply gt-specific cosmetic modifications
final_table <- tbl_summary3 %>%
  as_gt() %>% #as_gt allows us to make more customizations
  opt_stylize(style = 6, color = "blue") %>%
  tab_style(
    style = list(cell_fill(color = "lightgreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = estimate, rows = estimate > 0)
  ) %>%
  tab_options( #Tab options like fonts, font size, header font, and pixel sizes of cells
    table.font.name = "Arial",
    table.font.size = 12,
    heading.title.font.size = 16,
    column_labels.font.weight = "bold",
    table.border.top.style = "solid",
    table.border.bottom.style = "solid",
    column_labels.border.top.style = "solid",
    column_labels.border.bottom.style = "solid",
    data_row.padding = px(10),
    source_notes.font.size = 10,
    table.width = pct(100)
  ) %>%
  tab_style( #Colors 
    style = list(
      cell_fill(color = "darkblue"),
      cell_text(color = "white", weight = "bold")
    ),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style = cell_text(color = "navy"),
    locations = cells_body(
      columns = estimate
    )
  ) %>%
  tab_style( 
    style = cell_borders(
      sides = "right",
      color = "grey",
      weight = px(1)
    ),
    locations = cells_body(
      columns = everything()
    )
  ) %>%
  tab_style( #Fill colors
    style = list(
      cell_fill(color = "darkblue"),
      cell_text(color = "white", weight = "bold")
    ),
    locations = cells_column_spanners()
  )

# Print the final table
final_table

**Table 1. Linear Regression Model Predicting Total Cholesterol**
Variable	Coefficients and Findings	SE	t-value	Coefficients and Findings	p-value
Variable	Beta	SE	t-value	95% CI	p-value
Age (years)	0.00933	0.00080	11.703	0.00777, 0.01089	<0.001
Body Mass Index	0.00079	0.00204	0.38607	-0.00321, 0.00479	0.7
Alcohol consumption (12+ drinks/year)					<0.001
No	—	—	—	—
Yes	0.14708	0.03344	4.3981	0.08152, 0.21263
Physically active					>0.9
No	—	—	—	—
Yes	-0.00160	0.02781	-0.05748	-0.05611, 0.05291
Sleep hours per night	0.00202	0.01000	0.20159	-0.01759, 0.02162	0.8
Abbreviations: CI = Confidence Interval, SE = Standard Error

We can also save and export tables using the gtsave command. Below we can save our first table as the file example_table1.png

tbl_save <- summary_table #Set variable so the table can be saved

tbl_save %>%
  as_gt() %>% #Convert as into gtsummary doc
  gtsave(filename = "example_table1.png") # Save as PNG image using gtsave

file:///C:/Users/alexr/AppData/Local/Temp/RtmpKKm6p7/fileb34c183d554d.html screenshot completed

likewise you can save as a Word table using the package “flextable” and the command save_as_docx, and as an Excel file using the package “openxlsx” and the command as_hux_xlsx.

Part 2: Example usage for ECON 326 or ECON 490

Finally, let’s take a look at how we can use summary statistics tables for the applied sections of Econ 490 and Econ 326. We’ll be using the census_data datasets. Datasets created using the Cancensus API, an R interface that lets you tap directly into census data provided by Statistics Canada.

More information about the cancensus API can be found at this link This is particularly useful if you’d like to pull specific datasets from Statistics Canada yourself for future projects.

census_data <- read_csv("../datasets_intermediate/census_data.csv") #reading in the data, check out the github repository to see how to generate this yourself!

New names:
• `` -> `...1`

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 2091 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): ...1, population_density, age, income, education, car_commute_driv...
dbl  (2): pt_commute, total_reported_commute

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

census_data <- census_data %>%
mutate(across(-geometry, as.double)) %>%
drop_na()%>%
glimpse() #minor data cleaning

Warning: There were 10 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(-geometry, as.double)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 9 remaining warnings.

Rows: 1,680
Columns: 13
$ ...1                   <dbl> 1.00000, -123.02831, -123.02325, 49.28905, 2.00…
$ population_density     <dbl> 2110.90000, -123.02765, -123.02325, 49.28916, 4…
$ age                    <dbl> 43.20000, -123.02645, -123.02325, 49.28974, 42.…
$ income                 <dbl> 0.90965, -123.02631, -123.02327, 49.29078, 0.93…
$ education              <dbl> 195.00000, -123.02620, 49.28928, 49.29096, 140.…
$ car_commute_driver     <dbl> 240.00000, -123.02613, 49.28881, 49.29119, 165.…
$ car_commute_driven     <dbl> 0.00000, -123.02613, 49.28871, 49.29155, 0.0000…
$ pt_commute             <dbl> 25.00000, -123.02611, 49.28866, 49.29158, 20.00…
$ walk_commute           <dbl> 0.00000, -123.02610, 49.28837, 49.29191, 0.0000…
$ bike_commute           <dbl> 10.00000, -123.02609, 49.28791, 49.29215, 15.00…
$ other_commute          <dbl> 0.00000, -123.02609, 49.28750, 49.29239, 0.0000…
$ total_reported_commute <dbl> 165.00000, -123.02609, 49.28705, 49.29240, 120.…
$ geometry               <chr> "list(list(c(-123.02326818965, -123.02327362493…

Suppose, for our project, we were interested in determining if a greater proportion of people cycling to work is associated with a greater income, adjusted for age, education, and population density. Note the columns ending in _commute. These represent the number of walkers/bikers/drivers living within a given census area. total_reported_commute is the total number of individuals within each census district that reported their commute method to StatsCanada in the 2016 census. The first thing we want to check is that the sum of all _commute columns adds up to total_reported_commute. This will allow us to check for missing values, as well as, more importantly, that the vectors we picked for our analysis from the cancensus API are indeed the correct ones:

mismatch <- census_data %>%
  mutate(commute_sum = car_commute_driver + car_commute_driven +
           pt_commute + walk_commute + bike_commute + other_commute) %>%
  filter(commute_sum != total_reported_commute) %>%
  mutate(mismatch_amount = as.numeric(total_reported_commute - commute_sum))%>%
  mutate(commute_sum = as.numeric(commute_sum))%>%
  mutate(total_reported_commute = as.numeric(total_reported_commute)) %>%
  select(commute_sum, total_reported_commute, mismatch_amount)

glimpse(mismatch)

Rows: 1,675
Columns: 3
$ commute_sum            <dbl> 275.00000, -738.15665, 295.72996, 295.75077, 20…
$ total_reported_commute <dbl> 165.00000, -123.02609, 49.28705, 49.29240, 120.…
$ mismatch_amount        <dbl> -110.00000, 615.13056, -246.44291, -246.45838, …

We see that every row is slightly mismatched: people are either reporting more than one type of commute, or are reporting commute types that are not collected in the census (example, not commuting at all, or working from home). Let’s create a summary statistic table using gtsummary to figure out the scale of the differences.

summary_table <- mismatch %>%
  select(commute_sum, total_reported_commute, mismatch_amount) %>%
  tbl_summary(
    type = all_continuous() ~ "continuous2",  # we use continuous2 for multiple stats
    statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
    digits = all_continuous() ~ 2,
    label = list(
      commute_sum ~ "Total Commute",
      total_reported_commute ~ "Reported Commute",
      mismatch_amount ~ "Mismatch Amount"
    )
  )

summary_table

Characteristic	N = 1,675
Total Commute
Mean	164.36
Min	-739.33
Max	3,760.00
Reported Commute
Mean	103.70
Min	-123.22
Max	2,760.00
Mismatch Amount
Mean	-60.66
Min	-1,000.00
Max	788.34

Great! We see that the sum of all commute types is about 61 people more than the total amount of people who reported their method of commuting for each census area. This suggests that individuals are reporting more than one type of commute. Lastly, we’ll put together a table of summary statistics for all our variables. Recall our research question: We are interested in determining if a increase in proportion of cyclists commuting to work per capita is correlated with increases in average income. Hence, we can exclude non-cycling modes of transport from our summary statistic table. We’ll begin by some minor data wrangling:

census_data <- census_data %>%
mutate(bike_prop = bike_commute / total_reported_commute) %>% #getting the proportion of cyclists
as_tibble() %>%
drop_na() %>%
glimpse()

Rows: 1,679
Columns: 14
$ ...1                   <dbl> 1.00000, -123.02831, -123.02325, 49.28905, 2.00…
$ population_density     <dbl> 2110.90000, -123.02765, -123.02325, 49.28916, 4…
$ age                    <dbl> 43.20000, -123.02645, -123.02325, 49.28974, 42.…
$ income                 <dbl> 0.90965, -123.02631, -123.02327, 49.29078, 0.93…
$ education              <dbl> 195.00000, -123.02620, 49.28928, 49.29096, 140.…
$ car_commute_driver     <dbl> 240.00000, -123.02613, 49.28881, 49.29119, 165.…
$ car_commute_driven     <dbl> 0.00000, -123.02613, 49.28871, 49.29155, 0.0000…
$ pt_commute             <dbl> 25.00000, -123.02611, 49.28866, 49.29158, 20.00…
$ walk_commute           <dbl> 0.00000, -123.02610, 49.28837, 49.29191, 0.0000…
$ bike_commute           <dbl> 10.00000, -123.02609, 49.28791, 49.29215, 15.00…
$ other_commute          <dbl> 0.00000, -123.02609, 49.28750, 49.29239, 0.0000…
$ total_reported_commute <dbl> 165.00000, -123.02609, 49.28705, 49.29240, 120.…
$ geometry               <chr> "list(list(c(-123.02326818965, -123.02327362493…
$ bike_prop              <dbl> 0.06060606, 1.00000000, 1.00001734, 0.99999500,…

Lastly, let’s put everything together into a summary statistics table:

summary_table_2 <- census_data|>
  select(population_density, income, education, bike_prop) %>%
  tbl_summary(
    type = all_continuous() ~ "continuous2",  # continuous2 for multiple stats
    statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
    digits = all_continuous() ~ 2,
    label = list(
      population_density ~ "Population Density",
      income ~ "Income (in 100k CAD)",
      education ~ "Education Level",
      bike_prop ~ "Proportion of Bikers"
    )
  )

summary_table_2

Characteristic	N = 1,679
Population Density
Mean	5,905.33
Min	-123.22
Max	77,692.30
Income (in 100k CAD)
Mean	0.29
Min	-123.22
Max	49.31
Education Level
Mean	130.93
Min	-123.22
Max	4,700.00
Proportion of Bikers
Mean	0.43
Min	-2.50
Max	1.00

Looks great! Now suppose we wanted to isolate the summary statistics for census areas which have a bikeway passing through them and compare with census areas which do not have a bikeway (IE, include a dummy variable for presence of bikeways for each census area).

census_data_bikes <- read_csv('../datasets_intermediate/census_data_bikes.csv')

New names:
Rows: 991 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(7): ...1, population_density, age, income, education, has_bike_lane, bi...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

glimpse(census_data_bikes)

Rows: 991
Columns: 7
$ ...1               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ population_density <dbl> 2110.9, 4575.3, 6663.7, 4895.0, 6567.9, 6445.0, 629…
$ age                <dbl> 43.2, 42.2, 41.5, 40.2, 40.1, 41.0, 43.9, 37.6, 42.…
$ income             <dbl> 0.90965, 0.93611, 0.84736, 0.45184, 0.82517, 0.7436…
$ education          <dbl> 195, 140, 235, 100, 135, 80, 165, 75, 180, 80, 130,…
$ has_bike_lane      <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, …
$ bike_prop          <dbl> 0.06060606, 0.12500000, 0.00000000, 0.00000000, 0.0…

We will once again repeat the same process for generating our summary statistics table, this time with two columns.

summary_table_3 <- census_data_bikes  %>%
  select(population_density, income, education, bike_prop, has_bike_lane) |> # Include the dummy variable
  tbl_summary(
    by = has_bike_lane,  # Stratify by the dummy variable
     type = all_continuous() ~ "continuous2",  # continuous2 for multiple stats
    statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
    digits = all_continuous() ~ 2,
    label = list(
      population_density ~ "Population Density",
      income ~ "Income (in 100k CAD)",
      education ~ "Education Level",
      bike_prop ~ "Proportion of Bikers"
    )
  ) |>
  modify_header(!!!list(
      label ~ "**Variable**",  
      stat_1 ~ "Bike Lanes: No",  
      stat_2 ~ "Bike Lanes: Yes"  
    )
  )

summary_table_3

Variable	Bike Lanes: No	Bike Lanes: Yes
Population Density
Mean	11,110.27	9,563.23
Min	1,137.10	297.90
Max	70,337.10	77,692.30
Income (in 100k CAD)
Mean	0.68	0.75
Min	0.13	0.13
Max	1.57	2.14
Education Level
Mean	175.07	239.52
Min	15.00	20.00
Max	845.00	4,700.00
Proportion of Bikers
Mean	0.07	0.08
Min	0.00	0.00
Max	0.36	0.38

We now have a completed summary table!

References

von Bergmann J (2024). VancouvR: Access the ‘City of Vancouver’ Open Data API. R package version 0.1.8, https://CRAN.R-project.org/package=VancouvR.
von Bergmann, J., Dmitry Shkolnik, and Aaron Jacobs (2021). cancensus: R package to access, retrieve, and work with Canadian Census data and geography. v0.4.2.
Additionally, for this notebook we will use the NHANES dataset more can be found at https://cran.r-project.org/web/packages/NHANES/NHANES.pdf It is a dataset with information from the US National Health and Nutrition Examination Study This dataset is within a package so it is easy to install and is also complex enough to run regression later or to show the uses of gtsummary.
Statistics Canada. (2016). Census Profile, 2016 Census of Population. Retrieved February 9, 2025, from https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/index.cfm?Lang=E
City of Vancouver. (n.d.). Bikeways [Data set]. Retrieved February 9, 2025, from https://vancouver.opendatasoft.com/explore/dataset/bikeways/information/
- Data licensed under the Vancouver Open Government License.