# Uncomment and run this cell to install necessary packages
# install.packages("broom.helpers")
# install.packages("jsonlite")
# install.packages("NHANES")
# install.packages("gtsummary") #This is our key gtsummary package which is the topic of this notebookIntermediate - Summary Statistics
gtsummary package and gives examples of how to create and present effective summary statistics in a research project.
Outline
Prerequisites
- Intermediate knowledge of R and Basic Jupyter Skills
- An understanding of Tables and Basic Statistics
- Knowledge of Regressions
Outcomes
By the end of this notebook, you will be able to:
- Install and Use the
gtsummaryPackage - Use
gtsummaryto create basic summary tables - Use
gtsummaryto create stratified tables - Use
gtsummaryto customize the tables (Headers, Footnotes, colors and sizes) - Use
gtsummaryto create tables for regression output - Use
gtsummaryto Create tables for Projects in classes like Econ 326 or Econ 490
Part 1: Using gtsummary
Firstly let’s introduce gtsummary: gtsummary is an R package designed to create publication-ready analytical and summary tables with ease and flexibility. It offers a powerful set of tools for data scientists, statisticians, and researchers working with R to summarize datasets, regression models, and more. It is one of the best packages at this task and is a useful tool to use for projects, research or other statistical uses.
Let’s learn how to use it!
Running the next code block will install necessary packages, you may be prompted to choose a CRAN mirror, just choose the closest geographical one, example for Vancouver choose US (OR). This will be all of the packages needed for the notebook.
This will run the packages so we can use them for the code below.
library(jsonlite)Warning: package 'jsonlite' was built under R version 4.4.2
library(gt)Warning: package 'gt' was built under R version 4.4.3
library(gtsummary) #This is our key gtsummary package which is the topic of this notebookWarning: package 'gtsummary' was built under R version 4.4.3
library(dplyr)Warning: package 'dplyr' was built under R version 4.4.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyverse)Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.2
Warning: package 'forcats' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.1 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(NHANES)Warning: package 'NHANES' was built under R version 4.4.3
This is our dataset which we will be using for the first section of this notebook. We have chosen this dataset as it is available within R as a package and does not have to be installed separately. The NHANES dataset is a subset of data from the National Health and Nutrition Examination Survey, a comprehensive study designed to assess the health and nutritional status of adults and children in the United States.
Please run the cell below and see if the head command correctly displays the data.
#Check if the data is properly loaded
data(NHANES)
head(NHANES, 20)# A tibble: 20 × 76
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
<int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
1 51624 2009_10 male 34 " 30-39" 409 White <NA> High School
2 51624 2009_10 male 34 " 30-39" 409 White <NA> High School
3 51624 2009_10 male 34 " 30-39" 409 White <NA> High School
4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some College
6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
7 51646 2009_10 male 8 " 0-9" 101 White <NA> <NA>
8 51647 2009_10 female 45 " 40-49" 541 White <NA> College Grad
9 51647 2009_10 female 45 " 40-49" 541 White <NA> College Grad
10 51647 2009_10 female 45 " 40-49" 541 White <NA> College Grad
11 51654 2009_10 male 66 " 60-69" 795 White <NA> Some College
12 51656 2009_10 male 58 " 50-59" 707 White <NA> College Grad
13 51657 2009_10 male 54 " 50-59" 654 White <NA> 9 - 11th Grade
14 51659 2009_10 female 10 " 10-19" 123 White <NA> <NA>
15 51666 2009_10 female 58 " 50-59" 700 Mexican <NA> High School
16 51667 2009_10 male 50 " 50-59" 603 White <NA> Some College
17 51671 2009_10 female 9 " 0-9" 112 Black <NA> <NA>
18 51677 2009_10 male 33 " 30-39" 404 White <NA> High School
19 51678 2009_10 male 60 " 60-69" 721 White <NA> High School
20 51679 2009_10 male 16 " 10-19" 194 Other <NA> <NA>
# ℹ 67 more variables: MaritalStatus <fct>, HHIncome <fct>, HHIncomeMid <int>,
# Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>, Work <fct>, Weight <dbl>,
# Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
# BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>,
# BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>,
# BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>,
# TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>, …
Summary Tables
Here we will make a basic summary table with the variables Age, BMI, Weight, Race1, Education, Gender which will show an overview of the dataset. Along with this we will have first mean and then in brackets standard deviation and for specific variables percent of n within each category.
# Load the NHANES data
data(NHANES)
# Create a summary table for selected variables with our NHANES dataset
summary_table <- NHANES %>%
select(Age, BMI, Weight, Race1, Education, Gender) %>% #Here we select the variables we want on our summary table
tbl_summary(
statistic = list(
all_continuous() ~ "{mean} ({sd})", #all continuous variables will have mean and standard deviation
all_categorical() ~ "{n} ({p}%)" #all categorical variables will have n and p included
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
modify_header(label = "**Variable**") %>% #Using modify_header we can change the label
bold_labels()
# Print the summary table
summary_table| Variable | N = 10,0001 |
|---|---|
| Age | 36.7 (22.4) |
| BMI | 26.7 (7.4) |
| Weight | 71.0 (29.1) |
| Race1 | |
| Black | 1,197 (12%) |
| Hispanic | 610 (6.1%) |
| Mexican | 1,015 (10%) |
| White | 6,372 (64%) |
| Other | 806 (8.1%) |
| Education | |
| 8th Grade | 451 (6.2%) |
| 9 - 11th Grade | 888 (12%) |
| High School | 1,517 (21%) |
| Some College | 2,267 (31%) |
| College Grad | 2,098 (29%) |
| Gender | |
| female | 5,020 (50%) |
| male | 4,980 (50%) |
| 1 Mean (SD); n (%) | |
To show a more complex table we can also use the by command to sort by if the observation has diabetes. This is called stratifying by group variables. Here the key function is by where by setting it to the variable Diabetes we can show the summary statistics for three different groups Overall, No Diabetes and Diagnosed with Diabetes.
We can also use the command add_ to add certain statistics to our table for instance here we will use add_n() to show how many people in the dataset have diabetes. We can use add_ for different things apart from number of observations for instance add_p will show p values. More information on what can be added can be found at tbl_summary() tutorial under the header “{gtsummary} functions to add information.”
#Same as before make a table variable and select variables in the dataset using select(...)
summary_table2 <- NHANES %>%
select(Age, BMI, Weight, Race1, Gender, Diabetes) %>%
tbl_summary(
by = Diabetes, #Here we can stratify our table by people who have diabetes.
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n}"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_overall() %>%
add_n() %>% #Add Number of Observations
modify_header(label = "**Variable**") %>%
bold_labels() %>%
modify_spanning_header(c("stat_1", "stat_2") ~ "**Diabetes Status**") #modify_spanning_header is the overall title of our table142 missing rows in the "Diabetes" column have been removed.
#Print the table
summary_table2| Variable | N | Overall N = 9,8581 |
Diabetes Status
|
|
|---|---|---|---|---|
| No N = 9,0981 |
Yes N = 7601 |
|||
| Age | 9,858 | 37.3 (22.1) | 35.4 (21.6) | 59.2 (14.7) |
| BMI | 9,629 | 26.7 (7.4) | 26.2 (7.1) | 32.6 (8.2) |
| Weight | 9,780 | 71.9 (28.3) | 70.2 (28.0) | 91.6 (25.3) |
| Race1 | 9,858 | |||
| Black | 1,184 | 1,053 | 131 | |
| Hispanic | 602 | 555 | 47 | |
| Mexican | 991 | 925 | 66 | |
| White | 6,290 | 5,840 | 450 | |
| Other | 791 | 725 | 66 | |
| Gender | 9,858 | |||
| female | 4,949 | 4,592 | 357 | |
| male | 4,909 | 4,506 | 403 | |
| 1 Mean (SD); n | ||||
We can also make custom headers and footnotes using the command modify_header, modify_footnoteand modify_caption. This is shown in the codeblock below where we will create a table with a custom header, title and footnotes.
#Same as before but we to focus our code we will start from our output of the stratified table from summary_table2.
summary_table3 <- summary_table2 %>%
modify_header( #This command will modify the header in the table
stat_0 ~ "**All Observations**",
stat_1 ~ "**No Diabetes**",
stat_2 ~ "**Diagnosed with Diabetes**"
) %>%
modify_footnote(
all_stat_cols() ~ "Median (IQR) for continuous variables; n (%) for categorical variables" #Here we change the footnote with the command modify_footnote
) %>%
modify_caption("**Table 1. Demographic and Clinical Characteristics**") #And here we change the overall title
# Print the table
summary_table3| Variable | N | All Observations1 |
Diabetes Status
|
|
|---|---|---|---|---|
| No Diabetes1 | Diagnosed with Diabetes1 | |||
| Age | 9,858 | 37.3 (22.1) | 35.4 (21.6) | 59.2 (14.7) |
| BMI | 9,629 | 26.7 (7.4) | 26.2 (7.1) | 32.6 (8.2) |
| Weight | 9,780 | 71.9 (28.3) | 70.2 (28.0) | 91.6 (25.3) |
| Race1 | 9,858 | |||
| Black | 1,184 | 1,053 | 131 | |
| Hispanic | 602 | 555 | 47 | |
| Mexican | 991 | 925 | 66 | |
| White | 6,290 | 5,840 | 450 | |
| Other | 791 | 725 | 66 | |
| Gender | 9,858 | |||
| female | 4,949 | 4,592 | 357 | |
| male | 4,909 | 4,506 | 403 | |
| 1 Median (IQR) for continuous variables; n (%) for categorical variables | ||||
We can also change stylistic choices, this is the same table as before information wise, but it has been substantially changed visually. gtsummary offers many options to change the look of our tables. Below is an example where we will drastically change the look of the table using the commands tab_options and tab_style, you are free to experiment and change the code here to see how the colors, borders and labels can change.
summary_table4 <- NHANES %>% #create the variable
select(Age, BMI, Weight, Race1, Gender, Diabetes) %>% #select variables
tbl_summary(
by = Diabetes, #use the by command to stratify the table
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n}"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_overall() %>%
add_n() %>%
modify_header(label = "**Variable**") %>%
bold_labels() %>%
modify_spanning_header(c("stat_1", "stat_2") ~ "**Diabetes Status**") %>%
modify_header(
stat_0 ~ "**All Observations**",
stat_1 ~ "**No Diabetes**",
stat_2 ~ "**Diabetes**"
) %>%
modify_footnote( #change footnote label
all_stat_cols() ~ "Median (IQR) for continuous variables; n (%) for categorical variables"
) %>%
modify_caption("**Table 1. Demographic and Clinical Characteristics**") %>%
as_gt() %>%
tab_options( #Change tab options like font, font size, header styles
table.font.name = "Arial",
table.font.size = 12,
heading.title.font.size = 16,
column_labels.font.weight = "bold",
table.border.top.style = "solid",
table.border.bottom.style = "solid",
column_labels.border.top.style = "solid",
column_labels.border.bottom.style = "solid",
data_row.padding = px(10),
source_notes.font.size = 10,
table.width = pct(100)
) %>%
tab_style( #Change colors of the cells
style = list(
cell_fill(color = "lightblue"),
cell_text(weight = "bold")
),
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(color = "navy"),
locations = cells_body(
columns = vars(stat_0)
)
) %>%
tab_style( #Change cell border colors
style = cell_borders(
sides = "right",
color = "grey",
weight = px(1)
),
locations = cells_body(
columns = everything()
)
)142 missing rows in the "Diabetes" column have been removed.
Warning: Since gt v0.3.0, `columns = vars(...)` has been deprecated.
• Please use `columns = c(...)` instead.
# Print the table
summary_table4| Variable | N | All Observations1 |
Diabetes Status
|
|
|---|---|---|---|---|
| No Diabetes1 | Diabetes1 | |||
| Age | 9,858 | 37.3 (22.1) | 35.4 (21.6) | 59.2 (14.7) |
| BMI | 9,629 | 26.7 (7.4) | 26.2 (7.1) | 32.6 (8.2) |
| Weight | 9,780 | 71.9 (28.3) | 70.2 (28.0) | 91.6 (25.3) |
| Race1 | 9,858 | |||
| Black | 1,184 | 1,053 | 131 | |
| Hispanic | 602 | 555 | 47 | |
| Mexican | 991 | 925 | 66 | |
| White | 6,290 | 5,840 | 450 | |
| Other | 791 | 725 | 66 | |
| Gender | 9,858 | |||
| female | 4,949 | 4,592 | 357 | |
| male | 4,909 | 4,506 | 403 | |
| 1 Median (IQR) for continuous variables; n (%) for categorical variables | ||||
Regression Tables
Apart from summary tables gtsummary can also be used for regression analysis. Here we will continue to use the NHANES dataset, and the regression formula we will use is:
\[ \texttt{TotChol}_i = \beta_0 + \beta_1 \cdot \texttt{Age}_i + \beta_2 \cdot \texttt{BMI}_i + \beta_3 \cdot \texttt{Alcohol12PlusYr}_i + \beta_4 \cdot \texttt{PhysActive}_i + \beta_5 \cdot \texttt{SleepHrsNight}_i + \varepsilon_i \]
Here we will make a simple table showing our regression results, with our beta coefficients, 95% confidence intervals, and \(p\)-values.
# Prepare the data
nhanes_data <- NHANES %>%
select(TotChol, Age, BMI, Alcohol12PlusYr, PhysActive, SleepHrsNight) %>% #Select variables like above
na.omit() # Remove rows with missing values
# Our regression model from above
model <- lm(TotChol ~ Age + BMI + Alcohol12PlusYr + PhysActive + SleepHrsNight, data = nhanes_data)
# Create a summary table using gtsummary
tbl_summary <- tbl_regression(model,
label = list( #Set labels for our variables
Age ~ "Age (years)",
BMI ~ "Body Mass Index",
Alcohol12PlusYr ~ "Alcohol consumption (12+ drinks/year)",
PhysActive ~ "Physically active",
SleepHrsNight ~ "Sleep hours per night"
)) %>%
add_global_p() %>% #Add a p-value
bold_p(t = 0.05) %>%
bold_labels() %>%
modify_header(label = "**Variable**") %>% #Change headers like for summary tables
modify_spanning_header(c("estimate", "conf.low", "conf.high") ~ "**Coefficient (95% CI)**") %>% #Spaning header
modify_caption("**Table 1. Linear Regression Model Predicting Total Cholesterol**") #Changes the caption
# Print the table
tbl_summary| Variable |
Coefficient (95% CI)
|
p-value | |
|---|---|---|---|
| Beta | 95% CI | ||
| Age (years) | 0.01 | 0.01, 0.01 | <0.001 |
| Body Mass Index | 0.00 | 0.00, 0.00 | 0.7 |
| Alcohol consumption (12+ drinks/year) | <0.001 | ||
| No | — | — | |
| Yes | 0.15 | 0.08, 0.21 | |
| Physically active | >0.9 | ||
| No | — | — | |
| Yes | 0.00 | -0.06, 0.05 | |
| Sleep hours per night | 0.00 | -0.02, 0.02 | 0.8 |
| Abbreviation: CI = Confidence Interval | |||
Here we see that the regression table is good, but it can be improved, for instance we can change the number of significant figures on the coefficient values currently it is three and we can not clearly see the result of our regression analysis. Let’s make it 5 for the Beta’s and coefficient values. Along with this we are also missing Standard Errors and t-values which can show a more complete analysis. We will add them using the code below.
# Prepare the data for our model
nhanes_data <- NHANES %>%
select(TotChol, Age, BMI, Alcohol12PlusYr, PhysActive, SleepHrsNight) %>% #Select Variables
na.omit() # Remove rows with missing values
# Regression Model like above
model <- lm(TotChol ~ Age + BMI + Alcohol12PlusYr + PhysActive + SleepHrsNight, data = nhanes_data)
# GT table
tbl_summary2 <- tbl_regression(model,
label = list( #Labels for our variables
Age ~ "Age (years)",
BMI ~ "Body Mass Index",
Alcohol12PlusYr ~ "Alcohol consumption (12+ drinks/year)",
PhysActive ~ "Physically active",
SleepHrsNight ~ "Sleep hours per night"
)) %>%
modify_fmt_fun(estimate ~ function(x) style_sigfig(x, digits = 5)) %>%
modify_fmt_fun(c(conf.low, conf.high) ~ function(x) style_sigfig(x, digits = 5)) %>%
modify_column_unhide(columns = c(statistic, std.error)) %>% #Here we add standard Errors
modify_header(statistic ~ "**t-value**", std.error ~ "**SE**") %>% #Here we add t-value and SE
modify_fmt_fun(statistic ~ function(x) style_sigfig(x, digits = 5)) %>%
modify_fmt_fun(std.error ~ function(x) style_sigfig(x, digits = 5)) %>%
add_global_p() %>% #P-values
bold_p(t = 0.05) %>%
bold_labels() %>%
modify_header(label = "**Variable**") %>% #Like above modify labels for header
modify_spanning_header(c("estimate", "conf.low", "conf.high") ~ "**Coefficients and Findings**") %>% #Like above modify labels for spanning header
modify_caption("**Table 1. Linear Regression Model Predicting Total Cholesterol**") #Like above modify labels for caption
# Print our new table
tbl_summary2| Variable |
Coefficients and Findings
|
SE | t-value |
Coefficients and Findings
|
p-value |
|---|---|---|---|---|---|
| Beta | 95% CI | ||||
| Age (years) | 0.00933 | 0.00080 | 11.703 | 0.00777, 0.01089 | <0.001 |
| Body Mass Index | 0.00079 | 0.00204 | 0.38607 | -0.00321, 0.00479 | 0.7 |
| Alcohol consumption (12+ drinks/year) | <0.001 | ||||
| No | — | — | — | — | |
| Yes | 0.14708 | 0.03344 | 4.3981 | 0.08152, 0.21263 | |
| Physically active | >0.9 | ||||
| No | — | — | — | — | |
| Yes | -0.00160 | 0.02781 | -0.05748 | -0.05611, 0.05291 | |
| Sleep hours per night | 0.00202 | 0.01000 | 0.20159 | -0.01759, 0.02162 | 0.8 |
| Abbreviations: CI = Confidence Interval, SE = Standard Error | |||||
Just like above we can do all kinds of cosmetic changes to our new regression summary tables; for instance we can change colors or fonts. Once again feel free to change fonts or colors and see how the end result changes.
# Prepare the data
nhanes_data <- NHANES %>%
select(TotChol, Age, BMI, Alcohol12PlusYr, PhysActive, SleepHrsNight) %>%
na.omit() # Remove rows with missing values
# Regression Model
model <- lm(TotChol ~ Age + BMI + Alcohol12PlusYr + PhysActive + SleepHrsNight, data = nhanes_data)
# Create and modify the gtsummary table
tbl_summary3 <- tbl_regression(model,
label = list( #Labels for our varaibles
Age ~ "Age (years)",
BMI ~ "Body Mass Index",
Alcohol12PlusYr ~ "Alcohol consumption (12+ drinks/year)",
PhysActive ~ "Physically active",
SleepHrsNight ~ "Sleep hours per night"
)) %>%
modify_fmt_fun(estimate ~ function(x) style_sigfig(x, digits = 5)) %>%
modify_fmt_fun(c(conf.low, conf.high) ~ function(x) style_sigfig(x, digits = 5)) %>%
modify_column_unhide(columns = c(statistic, std.error)) %>%
modify_header( #Headers
label ~ "**Variable**",
estimate ~ "**Beta**",
std.error ~ "**SE**",
statistic ~ "**t-value**",
p.value ~ "**p-value**"
) %>%
modify_fmt_fun(statistic ~ function(x) style_sigfig(x, digits = 5)) %>%
modify_fmt_fun(std.error ~ function(x) style_sigfig(x, digits = 5)) %>%
add_global_p() %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
modify_spanning_header(c(estimate, conf.low, conf.high) ~ "**Coefficients and Findings**") %>% #Spanning Header
modify_caption("**Table 1. Linear Regression Model Predicting Total Cholesterol**") #Caption of Table
# Convert to gt object and apply gt-specific cosmetic modifications
final_table <- tbl_summary3 %>%
as_gt() %>% #as_gt allows us to make more customizations
opt_stylize(style = 6, color = "blue") %>%
tab_style(
style = list(cell_fill(color = "lightgreen"), cell_text(weight = "bold")),
locations = cells_body(columns = estimate, rows = estimate > 0)
) %>%
tab_options( #Tab options like fonts, font size, header font, and pixel sizes of cells
table.font.name = "Arial",
table.font.size = 12,
heading.title.font.size = 16,
column_labels.font.weight = "bold",
table.border.top.style = "solid",
table.border.bottom.style = "solid",
column_labels.border.top.style = "solid",
column_labels.border.bottom.style = "solid",
data_row.padding = px(10),
source_notes.font.size = 10,
table.width = pct(100)
) %>%
tab_style( #Colors
style = list(
cell_fill(color = "darkblue"),
cell_text(color = "white", weight = "bold")
),
locations = cells_column_labels()
) %>%
tab_style(
style = cell_text(color = "navy"),
locations = cells_body(
columns = estimate
)
) %>%
tab_style(
style = cell_borders(
sides = "right",
color = "grey",
weight = px(1)
),
locations = cells_body(
columns = everything()
)
) %>%
tab_style( #Fill colors
style = list(
cell_fill(color = "darkblue"),
cell_text(color = "white", weight = "bold")
),
locations = cells_column_spanners()
)
# Print the final table
final_table| Variable |
Coefficients and Findings
|
SE | t-value |
Coefficients and Findings
|
p-value |
|---|---|---|---|---|---|
| Beta | 95% CI | ||||
| Age (years) | 0.00933 | 0.00080 | 11.703 | 0.00777, 0.01089 | <0.001 |
| Body Mass Index | 0.00079 | 0.00204 | 0.38607 | -0.00321, 0.00479 | 0.7 |
| Alcohol consumption (12+ drinks/year) | <0.001 | ||||
| No | — | — | — | — | |
| Yes | 0.14708 | 0.03344 | 4.3981 | 0.08152, 0.21263 | |
| Physically active | >0.9 | ||||
| No | — | — | — | — | |
| Yes | -0.00160 | 0.02781 | -0.05748 | -0.05611, 0.05291 | |
| Sleep hours per night | 0.00202 | 0.01000 | 0.20159 | -0.01759, 0.02162 | 0.8 |
| Abbreviations: CI = Confidence Interval, SE = Standard Error | |||||
We can also save and export tables using the gtsave command. Below we can save our first table as the file example_table1.png
tbl_save <- summary_table #Set variable so the table can be saved
tbl_save %>%
as_gt() %>% #Convert as into gtsummary doc
gtsave(filename = "example_table1.png") # Save as PNG image using gtsavefile:///C:/Users/alexr/AppData/Local/Temp/RtmpKKm6p7/fileb34c183d554d.html screenshot completed
likewise you can save as a Word table using the package “flextable” and the command save_as_docx, and as an Excel file using the package “openxlsx” and the command as_hux_xlsx.
Part 2: Example usage for ECON 326 or ECON 490
Finally, let’s take a look at how we can use summary statistics tables for the applied sections of Econ 490 and Econ 326. We’ll be using the census_data datasets. Datasets created using the Cancensus API, an R interface that lets you tap directly into census data provided by Statistics Canada.
More information about the cancensus API can be found at this link This is particularly useful if you’d like to pull specific datasets from Statistics Canada yourself for future projects.
census_data <- read_csv("../datasets_intermediate/census_data.csv") #reading in the data, check out the github repository to see how to generate this yourself!New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 2091 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): ...1, population_density, age, income, education, car_commute_driv...
dbl (2): pt_commute, total_reported_commute
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
census_data <- census_data %>%
mutate(across(-geometry, as.double)) %>%
drop_na()%>%
glimpse() #minor data cleaning Warning: There were 10 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(-geometry, as.double)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 9 remaining warnings.
Rows: 1,680
Columns: 13
$ ...1 <dbl> 1.00000, -123.02831, -123.02325, 49.28905, 2.00…
$ population_density <dbl> 2110.90000, -123.02765, -123.02325, 49.28916, 4…
$ age <dbl> 43.20000, -123.02645, -123.02325, 49.28974, 42.…
$ income <dbl> 0.90965, -123.02631, -123.02327, 49.29078, 0.93…
$ education <dbl> 195.00000, -123.02620, 49.28928, 49.29096, 140.…
$ car_commute_driver <dbl> 240.00000, -123.02613, 49.28881, 49.29119, 165.…
$ car_commute_driven <dbl> 0.00000, -123.02613, 49.28871, 49.29155, 0.0000…
$ pt_commute <dbl> 25.00000, -123.02611, 49.28866, 49.29158, 20.00…
$ walk_commute <dbl> 0.00000, -123.02610, 49.28837, 49.29191, 0.0000…
$ bike_commute <dbl> 10.00000, -123.02609, 49.28791, 49.29215, 15.00…
$ other_commute <dbl> 0.00000, -123.02609, 49.28750, 49.29239, 0.0000…
$ total_reported_commute <dbl> 165.00000, -123.02609, 49.28705, 49.29240, 120.…
$ geometry <chr> "list(list(c(-123.02326818965, -123.02327362493…
Suppose, for our project, we were interested in determining if a greater proportion of people cycling to work is associated with a greater income, adjusted for age, education, and population density. Note the columns ending in _commute. These represent the number of walkers/bikers/drivers living within a given census area. total_reported_commute is the total number of individuals within each census district that reported their commute method to StatsCanada in the 2016 census. The first thing we want to check is that the sum of all _commute columns adds up to total_reported_commute. This will allow us to check for missing values, as well as, more importantly, that the vectors we picked for our analysis from the cancensus API are indeed the correct ones:
mismatch <- census_data %>%
mutate(commute_sum = car_commute_driver + car_commute_driven +
pt_commute + walk_commute + bike_commute + other_commute) %>%
filter(commute_sum != total_reported_commute) %>%
mutate(mismatch_amount = as.numeric(total_reported_commute - commute_sum))%>%
mutate(commute_sum = as.numeric(commute_sum))%>%
mutate(total_reported_commute = as.numeric(total_reported_commute)) %>%
select(commute_sum, total_reported_commute, mismatch_amount)
glimpse(mismatch)Rows: 1,675
Columns: 3
$ commute_sum <dbl> 275.00000, -738.15665, 295.72996, 295.75077, 20…
$ total_reported_commute <dbl> 165.00000, -123.02609, 49.28705, 49.29240, 120.…
$ mismatch_amount <dbl> -110.00000, 615.13056, -246.44291, -246.45838, …
We see that every row is slightly mismatched: people are either reporting more than one type of commute, or are reporting commute types that are not collected in the census (example, not commuting at all, or working from home). Let’s create a summary statistic table using gtsummary to figure out the scale of the differences.
summary_table <- mismatch %>%
select(commute_sum, total_reported_commute, mismatch_amount) %>%
tbl_summary(
type = all_continuous() ~ "continuous2", # we use continuous2 for multiple stats
statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
digits = all_continuous() ~ 2,
label = list(
commute_sum ~ "Total Commute",
total_reported_commute ~ "Reported Commute",
mismatch_amount ~ "Mismatch Amount"
)
)
summary_table| Characteristic | N = 1,675 |
|---|---|
| Total Commute | |
| Mean | 164.36 |
| Min | -739.33 |
| Max | 3,760.00 |
| Reported Commute | |
| Mean | 103.70 |
| Min | -123.22 |
| Max | 2,760.00 |
| Mismatch Amount | |
| Mean | -60.66 |
| Min | -1,000.00 |
| Max | 788.34 |
Great! We see that the sum of all commute types is about 61 people more than the total amount of people who reported their method of commuting for each census area. This suggests that individuals are reporting more than one type of commute. Lastly, we’ll put together a table of summary statistics for all our variables. Recall our research question: We are interested in determining if a increase in proportion of cyclists commuting to work per capita is correlated with increases in average income. Hence, we can exclude non-cycling modes of transport from our summary statistic table. We’ll begin by some minor data wrangling:
census_data <- census_data %>%
mutate(bike_prop = bike_commute / total_reported_commute) %>% #getting the proportion of cyclists
as_tibble() %>%
drop_na() %>%
glimpse()Rows: 1,679
Columns: 14
$ ...1 <dbl> 1.00000, -123.02831, -123.02325, 49.28905, 2.00…
$ population_density <dbl> 2110.90000, -123.02765, -123.02325, 49.28916, 4…
$ age <dbl> 43.20000, -123.02645, -123.02325, 49.28974, 42.…
$ income <dbl> 0.90965, -123.02631, -123.02327, 49.29078, 0.93…
$ education <dbl> 195.00000, -123.02620, 49.28928, 49.29096, 140.…
$ car_commute_driver <dbl> 240.00000, -123.02613, 49.28881, 49.29119, 165.…
$ car_commute_driven <dbl> 0.00000, -123.02613, 49.28871, 49.29155, 0.0000…
$ pt_commute <dbl> 25.00000, -123.02611, 49.28866, 49.29158, 20.00…
$ walk_commute <dbl> 0.00000, -123.02610, 49.28837, 49.29191, 0.0000…
$ bike_commute <dbl> 10.00000, -123.02609, 49.28791, 49.29215, 15.00…
$ other_commute <dbl> 0.00000, -123.02609, 49.28750, 49.29239, 0.0000…
$ total_reported_commute <dbl> 165.00000, -123.02609, 49.28705, 49.29240, 120.…
$ geometry <chr> "list(list(c(-123.02326818965, -123.02327362493…
$ bike_prop <dbl> 0.06060606, 1.00000000, 1.00001734, 0.99999500,…
Lastly, let’s put everything together into a summary statistics table:
summary_table_2 <- census_data|>
select(population_density, income, education, bike_prop) %>%
tbl_summary(
type = all_continuous() ~ "continuous2", # continuous2 for multiple stats
statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
digits = all_continuous() ~ 2,
label = list(
population_density ~ "Population Density",
income ~ "Income (in 100k CAD)",
education ~ "Education Level",
bike_prop ~ "Proportion of Bikers"
)
)
summary_table_2| Characteristic | N = 1,679 |
|---|---|
| Population Density | |
| Mean | 5,905.33 |
| Min | -123.22 |
| Max | 77,692.30 |
| Income (in 100k CAD) | |
| Mean | 0.29 |
| Min | -123.22 |
| Max | 49.31 |
| Education Level | |
| Mean | 130.93 |
| Min | -123.22 |
| Max | 4,700.00 |
| Proportion of Bikers | |
| Mean | 0.43 |
| Min | -2.50 |
| Max | 1.00 |
Looks great! Now suppose we wanted to isolate the summary statistics for census areas which have a bikeway passing through them and compare with census areas which do not have a bikeway (IE, include a dummy variable for presence of bikeways for each census area).
census_data_bikes <- read_csv('../datasets_intermediate/census_data_bikes.csv')New names:
Rows: 991 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(7): ...1, population_density, age, income, education, has_bike_lane, bi...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
glimpse(census_data_bikes)Rows: 991
Columns: 7
$ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ population_density <dbl> 2110.9, 4575.3, 6663.7, 4895.0, 6567.9, 6445.0, 629…
$ age <dbl> 43.2, 42.2, 41.5, 40.2, 40.1, 41.0, 43.9, 37.6, 42.…
$ income <dbl> 0.90965, 0.93611, 0.84736, 0.45184, 0.82517, 0.7436…
$ education <dbl> 195, 140, 235, 100, 135, 80, 165, 75, 180, 80, 130,…
$ has_bike_lane <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, …
$ bike_prop <dbl> 0.06060606, 0.12500000, 0.00000000, 0.00000000, 0.0…
We will once again repeat the same process for generating our summary statistics table, this time with two columns.
summary_table_3 <- census_data_bikes %>%
select(population_density, income, education, bike_prop, has_bike_lane) |> # Include the dummy variable
tbl_summary(
by = has_bike_lane, # Stratify by the dummy variable
type = all_continuous() ~ "continuous2", # continuous2 for multiple stats
statistic = list(all_continuous() ~ c("{mean}", "{min}", "{max}")),
digits = all_continuous() ~ 2,
label = list(
population_density ~ "Population Density",
income ~ "Income (in 100k CAD)",
education ~ "Education Level",
bike_prop ~ "Proportion of Bikers"
)
) |>
modify_header(!!!list(
label ~ "**Variable**",
stat_1 ~ "Bike Lanes: No",
stat_2 ~ "Bike Lanes: Yes"
)
)
summary_table_3| Variable | Bike Lanes: No | Bike Lanes: Yes |
|---|---|---|
| Population Density | ||
| Mean | 11,110.27 | 9,563.23 |
| Min | 1,137.10 | 297.90 |
| Max | 70,337.10 | 77,692.30 |
| Income (in 100k CAD) | ||
| Mean | 0.68 | 0.75 |
| Min | 0.13 | 0.13 |
| Max | 1.57 | 2.14 |
| Education Level | ||
| Mean | 175.07 | 239.52 |
| Min | 15.00 | 20.00 |
| Max | 845.00 | 4,700.00 |
| Proportion of Bikers | ||
| Mean | 0.07 | 0.08 |
| Min | 0.00 | 0.00 |
| Max | 0.36 | 0.38 |
We now have a completed summary table!
References
- von Bergmann J (2024). VancouvR: Access the ‘City of Vancouver’ Open Data API. R package version 0.1.8, https://CRAN.R-project.org/package=VancouvR.
- von Bergmann, J., Dmitry Shkolnik, and Aaron Jacobs (2021). cancensus: R package to access, retrieve, and work with Canadian Census data and geography. v0.4.2.
- Additionally, for this notebook we will use the NHANES dataset more can be found at https://cran.r-project.org/web/packages/NHANES/NHANES.pdf It is a dataset with information from the US National Health and Nutrition Examination Study This dataset is within a package so it is easy to install and is also complex enough to run regression later or to show the uses of
gtsummary. - Statistics Canada. (2016). Census Profile, 2016 Census of Population. Retrieved February 9, 2025, from https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/index.cfm?Lang=E
- City of Vancouver. (n.d.). Bikeways [Data set]. Retrieved February 9, 2025, from https://vancouver.opendatasoft.com/explore/dataset/bikeways/information/
- Data licensed under the Vancouver Open Government License.