rm()
rm(list=ls())04 - Opening Datasets
Prerequisites
- Understand the basics of R such as data types and structures.
Learning Outcomes
- Load a variety of data types into R using various functions.
- View and reformat variables, specifically by factorizing.
- Work with missing data.
- Select subsets of observations and variables for use.
4.0 Intro
In this notebook, we will focus on loading, viewing and cleaning up our data set: these are fundamental skills which will be necessary for essentially every data project we will do. This data analysis process usually consists of four steps:
- We clear the workspace and set up the directory (the folder that R accesses whenever we run a command that either opens or saves a file).
- We load the data into R, meaning we take a file on our computer and tell R how to interpret it.
- We inspect the data through a variety of methods to ensure it looks good and has been properly loaded.
- We clean up the data by removing missing observations and adjusting the way variables are interpreted.
In this module, we will cover each of these steps in detail.
4.1 Clearing the Workspace and Changing the Directory
Our script files should begin with a command that clears the previous work that has been done in R. This makes sure that:
- we do not waste computer memory on things other than the current project;
- whatever result we obtain in the current session truly belongs to that session.
To clear our workspace, we can use the rm() function. If we do not specify any object names as inputs of the rm() function, R will remove all objects available in the workspace. Alternatively, we can use the function ls() which lists all the objects in the current workspace. Any one of the two commands below will clear our workspace from all existing objects.
Before importing data into R, it is useful to know how to change the folder that R accesses whenever we run a command that either opens or saves a file. Once we instruct R to change the directory to a specific folder, from that point onward it will open files from that folder and save all files to that folder, including data files and script files. R will continue to do this until either the program is closed or we change to another directory.
Before changing the directory, it is important to know what the current directory is. In R, we can view the current directory with the command getwd().
print(getwd())[1] "D:/GitHub/comet-open/project/docs/5_Research/econ490-r"
Now that we know what the current directory is, we can change it to any specific location you like by using the command setwd() and a file path in quotes.
For example, we can change our working directory to a directory named “some_folder/some_folder” with the command setwd("some_folder/some_subfolder").
Instead of changing directory every time, R allows us to create ‘projects’. RStudio Projects are built-in features of RStudio that allow us to create a working directory for a project which we can launch whenever we want.
To create an RStudio Project, first launch RStudio. Then navigate through File, New Project, New Directory, and then New Project. We can then choose the name of our project and select where we would like the project to be stored. To allow for the project to live on OneDrive (which is highly recommended), we select the OneDrive directory in our computer. Finally, we can create the project. If we access your OneDrive folder on our computer, we should then see a subfolder with our project name and a default .RProj object already inside.
Whenever we want to return to our project to work on it, we can simply click the .RStudio Project object above. We can also start a fresh session in RStudio and navigate to our project by selecting File, Open Project, and then following the specified instructions.
More details on RStudio Projects can be found in Module 17.
4.2 Loading Data
Before we can load our data, we need to tell R which packages we will be using in our notebook. Without these packages, R will not have access to the appropriate functions needed to interpret our raw data. As explained previously, packages only need to be installed once; however, they need to be imported every time we open a notebook.
We have discussed packages previously: for data loading, the two most important ones are tidyverse and haven.
tidyverseshould already be somewhat familiar. It includes a wide range of useful functions for working with data in R.havenis a special package containing functions that can be used to import data.
Let’s get started by loading them now.
# loading in our packages
library(tidyverse)Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.2
Warning: package 'forcats' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)Warning: package 'haven' was built under R version 4.4.2
library(IRdisplay)Data can be created by different programs and stored in different styles - these are called file types. We can usually tell what kind of file type we are working with by looking at the extension. For example, a text file usually has the extension .txt. The data we will be using in this course is commonly stored in Stata, Excel, text, or comma-separated variables (csv) files. These have the following types:
- .dta for a Stata data file;
- .xls or .xlsx for an Excel file;
- .txt for a text file;
- .csv for a comma-separated variables file.
To load any data set, we need to use the appropriate function in order to specify to R the format in which the data is stored:
- To load a .csv file, we use the command
read_csv("file name"). - To load a STATA data file, we use the command
read_dta("file name"). - To load an Excel file, we use the command
read_excel("file name"). - To load a text file, we use the command
read_table("file name", header = FALSE).- The header argument specifies whether or not we have specified column names in our data file.
- There exist many other commands to import different types of data files. Feel free to research other shortcuts that might help you with whatever data you are using!
In this module, we’ll be working with the data set in the "fake_data.dta" files. This data set is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
Let’s read in our data in .dta format now.
# reading in the data
fake_data <- read_dta("../econ490-stata/fake_data.dta") ## .. just tells R to go back one folder.4.3 Viewing Data
Now that we’ve loaded in our data, it’s important to inspect the data. Let’s look at a series of commands which help us to do this.
4.3.1 glimpse and print
The first command we are going to use describes the basic characteristics of the variables in the loaded data set.
glimpse(fake_data)Rows: 138,138
Columns: 11
$ workerid <dbl> 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8,…
$ year <dbl> 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 20…
$ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
$ age <dbl> 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, …
$ start_year <dbl> 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 19…
$ region <dbl> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4,…
$ treated <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ earnings <dbl> 39975.008, 278378.062, 18682.600, 293336.406, 111797.258…
$ sample_weight <dbl> 0.26076493, 0.01427392, 0.03218683, 0.47120222, 0.704380…
$ quarter_birth <dbl> 2, 2, 4, 2, 2, 4, 2, 2, 3, 3, 3, 4, 3, 2, 2, 1, 3, 3, 3,…
$ schooling <dbl> 16, 16, 16, 16, 16, 16, 16, 12, 12, 14, 14, 16, 16, 16, …
Alternatively, we can use the print command, which displays the same information as the glimpse command but in horizontal form.
print(fake_data)# A tibble: 138,138 × 11
workerid year sex age start_year region treated earnings sample_weight
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1999 M 55 1997 1 0 39975. 0.261
2 1 2001 M 57 1997 1 0 278378. 0.0143
3 2 2001 M 54 2001 4 0 18683. 0.0322
4 2 2002 M 55 2001 4 0 293336. 0.471
5 2 2003 M 56 2001 4 0 111797. 0.704
6 3 2005 M 54 2005 5 0 88352. 0.356
7 3 2010 M 59 2005 5 0 46230. 0.897
8 4 1997 M 45 1997 5 1 24911. 0.399
9 4 2001 M 49 1997 5 1 9908. 0.552
10 5 2009 M 55 1998 2 1 137207. 0.0144
# ℹ 138,128 more rows
# ℹ 2 more variables: quarter_birth <dbl>, schooling <dbl>
With many variables, this can be harder to read than the glimpse command. Thus, we typically prefer to use the glimpse command.
4.3.2 view, head, and tail
In addition to the glimpse command, we can also see the raw data we have imported as if it were an Excel file. To do this, we can use the view function. This command will open a clear representation of our data as though it were a spreadsheet. We can also use the command head. This prints out a preview of our data set exactly as it would appear in Excel (showing the first ten rows by default). We can then specify a numeric argument to the function to change the number of rows we want to see, as well as the specific rows we want via indicating their positions.
head(fake_data)# A tibble: 6 × 11
workerid year sex age start_year region treated earnings sample_weight
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1999 M 55 1997 1 0 39975. 0.261
2 1 2001 M 57 1997 1 0 278378. 0.0143
3 2 2001 M 54 2001 4 0 18683. 0.0322
4 2 2002 M 55 2001 4 0 293336. 0.471
5 2 2003 M 56 2001 4 0 111797. 0.704
6 3 2005 M 54 2005 5 0 88352. 0.356
# ℹ 2 more variables: quarter_birth <dbl>, schooling <dbl>
There is even the function tail, which functions identically to head but works from the back of the data set (outputs the final rows).
tail(fake_data)# A tibble: 6 × 11
workerid year sex age start_year region treated earnings sample_weight
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39999 1995 M 32 1995 1 0 80151. 0.154
2 39999 1997 M 34 1995 1 0 17961. 0.336
3 39999 1998 M 35 1995 1 0 37257. 0.390
4 39999 2003 M 40 1995 1 0 175264. 0.857
5 39999 2004 M 41 1995 1 0 70465. 0.254
6 39999 2005 M 42 1995 1 0 85328. 0.117
# ℹ 2 more variables: quarter_birth <dbl>, schooling <dbl>
Opening the data editor has many benefits. Most importantly we get to see our data as a whole, allowing us to have a clearer perspective of the information the data set is providing us. For example, here we observe that we have unique worker codes, the year where they are observed, worker characteristics, and whether or not they participated in the training program. This viewing process is particularly useful when we first load a data set, since it lets us know if our data has been loaded in correctly and looks appropriate.
4.3.3 summary and sapply
We can further analyze any variable by using the summary command. This command gives us the minimum, 25th percentile, 50th percentile (median), 75th percentile, and max of each of our variables, as well as the mean of each of these variables. It is a good command for getting a quick overview of the general spread of all variables in our data set.
summary(fake_data) workerid year sex age
Min. : 1 Min. :1995 Length:138138 Min. :23.00
1st Qu.:10107 1st Qu.:1998 Class :character 1st Qu.:40.00
Median :20025 Median :2001 Mode :character Median :45.00
Mean :20064 Mean :2002 Mean :45.16
3rd Qu.:30045 3rd Qu.:2005 3rd Qu.:51.00
Max. :39999 Max. :2011 Max. :63.00
start_year region treated earnings
Min. :1995 Min. :1.000 Min. :0.0000 Min. : 36
1st Qu.:1995 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 20563
Median :1995 Median :2.000 Median :0.0000 Median : 43783
Mean :1995 Mean :2.469 Mean :0.2943 Mean : 84136
3rd Qu.:1995 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.: 92378
Max. :2010 Max. :5.000 Max. :1.0000 Max. :63573580
sample_weight quarter_birth schooling
Min. :0.0000008 Min. :1.000 Min. : 8.0
1st Qu.:0.2511469 1st Qu.:2.000 1st Qu.:14.0
Median :0.5018479 Median :3.000 Median :15.0
Mean :0.5010372 Mean :2.499 Mean :15.4
3rd Qu.:0.7501103 3rd Qu.:3.000 3rd Qu.:17.0
Max. :0.9999995 Max. :4.000 Max. :26.0
From the command above, we can tell that this function will only be meaningful for variables in numeric or integer form.
We can also apply summary to specific variables.
summary(fake_data$earnings) Min. 1st Qu. Median Mean 3rd Qu. Max.
36 20563 43783 84136 92378 63573580
If we want to quickly access more specific information about our variables, such as their standard deviations, we can supply this as an argument to the function sapply. It will output the standard deviations of each of our numeric variables. However, it will not operate on character variables. Remember, we can check the type of each variable using the glimpse function from earlier.
sapply(fake_data, sd)Warning in is.data.frame(x): NAs introduced by coercion
workerid year sex age start_year
1.153471e+04 4.656744e+00 NA 7.423285e+00 1.611155e+00
region treated earnings sample_weight quarter_birth
1.554385e+00 4.557356e-01 2.528017e+05 2.882314e-01 9.587452e-01
schooling
2.205572e+00
We can also apply arguments such as mean, min, and median to the function above; however, sd is a good one since it is not covered in the summary function.
4.3.4 count and table
We can also learn more about the frequency of the different measures of our variables by using the command count. We simply supply a specific variable to the function to see the distribution of values for that variable.
count(fake_data, region)# A tibble: 5 × 2
region n
<dbl> <int>
1 1 54364
2 2 34072
3 3 6216
4 4 17572
5 5 25914
Here we can see that there are five regions indicated in this data set, that more people surveyed came from region 1 and then fewer people surveyed came from region 3. Similarly, we can use the table function and specify our variable to accomplish the same task.
table(fake_data$region)
1 2 3 4 5
54364 34072 6216 17572 25914
4.4 Cleaning Data
Now that we’ve loaded in our data, the next step is to do some rudimentary data cleaning. This most commonly includes factorizing variables and dropping missing observations.
4.4.1 Factorizing Variables
We have already seen that there are different types of variables which can be stored in R. Namely, there are quantitative variables and qualitative variables. Any quantitative variable can be stored in R as a set of strings or letters. These are known as character variables. Qualitative variables can also be stored in R as factor variables. Factor variables associate a qualitative response to a categorical value, making analysis much easier. Additionally, data is often encoded, meaning that the levels of a qualitative variable are represented by “codes”, usually in numeric form.
Look at the region variable in the output from glimpse above:
region <dbl> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
The region variable in this data set corresponds to a particular region that the worker is living in. We can also see the variable type is <dbl+lbl>: this is a labeled double. This is good: it means that R already understands what the levels of this variable mean.
There are three similar ways to change variables into factor variables.
- We can change a specific variable inside a data frame to a factor by using the
as_factorcommand. Let’s do that below, using the special pipe%>%operator. This operator allows us to pipe existing code into a new function. In this way, it helps us break up long code across many lines, improving legibility. You can think of the pipe operator as saying AND THEN when describing your code aloud.
fake_data <- fake_data %>% #we start by saying we want to update the data, AND THEN... (%>%)
mutate(region = as_factor(region)) #mutate (update) region to be a factor variable
glimpse(fake_data)Rows: 138,138
Columns: 11
$ workerid <dbl> 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8,…
$ year <dbl> 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 20…
$ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
$ age <dbl> 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, …
$ start_year <dbl> 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 19…
$ region <fct> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4,…
$ treated <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ earnings <dbl> 39975.008, 278378.062, 18682.600, 293336.406, 111797.258…
$ sample_weight <dbl> 0.26076493, 0.01427392, 0.03218683, 0.47120222, 0.704380…
$ quarter_birth <dbl> 2, 2, 4, 2, 2, 4, 2, 2, 3, 3, 3, 4, 3, 2, 2, 1, 3, 3, 3,…
$ schooling <dbl> 16, 16, 16, 16, 16, 16, 16, 12, 12, 14, 14, 16, 16, 16, …
Do you see the difference in the region variable? You can also see that the type has changed to <fct>, a factor variable.
R would already know how to “decode” the factor variables from the imported data if and only if they were of type <dbl+lbl>. What about when this isn’t the case? This brings us to the next method:
- We can supply a list of factors using the
factorcommand. This command takes two other values:- A list of levels the qualitative variable will take on.
- A list of labels, one for each level, describing what each level means.
We can create a custom factor variable as follows:
#first, we write down a list of levels
region_levels = c(1:5)
#then, we write down a list of our labels
region_labels = c('Region A', 'Region B', 'Region C', 'Region D', 'Region E')
#now, we use the command but with some options - telling factor() how to interpret the levels
fake_data <- fake_data %>% #we start by saying we want to update the data, AND THEN... (%>%)
mutate(region2 = factor(region, #notice it's factor, not as_factor
levels = region_levels,
labels = region_labels)) #mutate (update region) to be a factor of regions
glimpse(fake_data)Rows: 138,138
Columns: 12
$ workerid <dbl> 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8,…
$ year <dbl> 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 20…
$ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
$ age <dbl> 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, …
$ start_year <dbl> 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 19…
$ region <fct> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4,…
$ treated <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ earnings <dbl> 39975.008, 278378.062, 18682.600, 293336.406, 111797.258…
$ sample_weight <dbl> 0.26076493, 0.01427392, 0.03218683, 0.47120222, 0.704380…
$ quarter_birth <dbl> 2, 2, 4, 2, 2, 4, 2, 2, 3, 3, 3, 4, 3, 2, 2, 1, 3, 3, 3,…
$ schooling <dbl> 16, 16, 16, 16, 16, 16, 16, 12, 12, 14, 14, 16, 16, 16, …
$ region2 <fct> Region A, Region A, Region D, Region D, Region D, Region…
Again, do you see the difference between region and region2 here? This is how we can customize factor labels when creating new variables.
- The final method is very similar to the first. If we have a large data set, it can be tiresome to decode all of the variables one-by-one. Instead, we can use
as_factoron the entire data set and it will convert all of the variables with appropriate types.
fake_data <- as_factor(fake_data)
glimpse(fake_data)Rows: 138,138
Columns: 12
$ workerid <dbl> 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8,…
$ year <dbl> 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 20…
$ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
$ age <dbl> 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, …
$ start_year <dbl> 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 19…
$ region <fct> 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4,…
$ treated <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ earnings <dbl> 39975.008, 278378.062, 18682.600, 293336.406, 111797.258…
$ sample_weight <dbl> 0.26076493, 0.01427392, 0.03218683, 0.47120222, 0.704380…
$ quarter_birth <dbl> 2, 2, 4, 2, 2, 4, 2, 2, 3, 3, 3, 4, 3, 2, 2, 1, 3, 3, 3,…
$ schooling <dbl> 16, 16, 16, 16, 16, 16, 16, 12, 12, 14, 14, 16, 16, 16, …
$ region2 <fct> Region A, Region A, Region D, Region D, Region D, Region…
This is our final data set, with all variables factorized.
4.4.2 Removing Missing Data
We often face the challenge of dealing with missing values among observations for some of our variables. To check if any of our variables have missing values, we can use the is.na function alongside the any function. This code will return a value of TRUE or FALSE depending on whether we do or do not have any missing observations in our data set.
any(is.na(fake_data))[1] FALSE
Here, we can see that our data set already has no missing observations, so we do not need to worry about the process of potentially removing or redefining them. However, this is often not the case.
Let’s go through the process of dropping missing observations for the sex variable anyway, assuming that missing observations are coded as “not available”. We will do this as a demonstration, even though no observations will actually be dropped. To do this, we will use the filter() method. This function conditionally drops rows (observations) by evaluating each row against the supplied condition. Only observations where the condition is true/met are retained (selection by inclusion) in the data frame. To use this to drop hypothetical missing observations for sex, we do the following:
filter(fake_data, sex != "not available")# A tibble: 138,138 × 12
workerid year sex age start_year region treated earnings sample_weight
<dbl> <dbl> <chr> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 1 1999 M 55 1997 1 0 39975. 0.261
2 1 2001 M 57 1997 1 0 278378. 0.0143
3 2 2001 M 54 2001 4 0 18683. 0.0322
4 2 2002 M 55 2001 4 0 293336. 0.471
5 2 2003 M 56 2001 4 0 111797. 0.704
6 3 2005 M 54 2005 5 0 88352. 0.356
7 3 2010 M 59 2005 5 0 46230. 0.897
8 4 1997 M 45 1997 5 1 24911. 0.399
9 4 2001 M 49 1997 5 1 9908. 0.552
10 5 2009 M 55 1998 2 1 137207. 0.0144
# ℹ 138,128 more rows
# ℹ 3 more variables: quarter_birth <dbl>, schooling <dbl>, region2 <fct>
!= is a conditional statement for “not equal to”. Therefore, we are telling R to keep the observations that are not equal to “not available”.
This process utilized the filter function, which retains rows meeting a specific condition. However, we can also supply a series of conditions to filter at once. We could have, for instance, decided that we only wanted to keep observations for females from region 1. In this case, we could run the following code.
head(filter(fake_data, sex == "F" & region == 1))# A tibble: 6 × 12
workerid year sex age start_year region treated earnings sample_weight
<dbl> <dbl> <chr> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 286 1997 F 44 1995 1 0 42032. 0.782
2 286 2004 F 51 1995 1 0 268393. 0.872
3 286 2005 F 52 1995 1 0 39651. 0.657
4 286 2011 F 58 1995 1 0 19746. 0.916
5 288 2003 F 36 1999 1 1 39691. 0.369
6 288 2006 F 39 1999 1 1 11393. 0.573
# ℹ 3 more variables: quarter_birth <dbl>, schooling <dbl>, region2 <fct>
4.4.3 Removing Variables
Beyond filtering observations as was done above, we sometimes want to “filter” our variables. This process of operating on columns instead of rows requires the select function instead of the filter function. This is a useful function when we have more data at our disposal than we actually need to answer the research question at hand. This is especially pertinent given the propensity for data sets to collect an abundance of information, some of which may not be useful to us and instead slow down our loading and cleaning process.
Let’s assume we are interested in seeing the gender wage gap among male and female workers of region 2, and nothing else. To help us with our analysis, we can filter by only observations which belong to region 2, then select for just the variables we are interested in.
head(fake_data %>% filter(region == 2) %>% select(sex, earnings)) # A tibble: 6 × 2
sex earnings
<chr> <dbl>
1 M 137207.
2 M 5228.
3 M 208645.
4 M 330875.
5 M 296822
6 M 165501.
In the code above, we pass as parameters to the select function every column we wish to keep.
select(variables, I, want, to, keep)select(-variables, -I, -don't, -want)
This is very useful and is usually done for practical reasons such as memory. Cleaning data sets to remove unessential information also allows us to focus our analysis and makes it easier to answer our desired research question. In our specific case, we want to keep data on just wages and sex. We have used the select function for this. If we were to further our research of the gender wage gap within region 2, we would now be able to refer to “fake_data” more quickly for immediate results.
4.5 Common Mistakes
Common mistakes happen because we do not respect the format of specific variables. Let’s say we want to filter the observations in order to get only women working in region 1. We may forget that variable sex is a string variable and type the following:
head(filter(fake_data, sex == F & region == 1))# A tibble: 0 × 12
# ℹ 12 variables: workerid <dbl>, year <dbl>, sex <chr>, age <dbl>,
# start_year <dbl>, region <fct>, treated <dbl>, earnings <dbl>,
# sample_weight <dbl>, quarter_birth <dbl>, schooling <dbl>, region2 <fct>
We obtain a tibble with no observations. This mistake occurs when we forget to wrap values of string variables in quotes. The correct command would be the following:
head(filter(fake_data, sex == "F" & region == 1))# A tibble: 6 × 12
workerid year sex age start_year region treated earnings sample_weight
<dbl> <dbl> <chr> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 286 1997 F 44 1995 1 0 42032. 0.782
2 286 2004 F 51 1995 1 0 268393. 0.872
3 286 2005 F 52 1995 1 0 39651. 0.657
4 286 2011 F 58 1995 1 0 19746. 0.916
5 288 2003 F 36 1999 1 1 39691. 0.369
6 288 2006 F 39 1999 1 1 11393. 0.573
# ℹ 3 more variables: quarter_birth <dbl>, schooling <dbl>, region2 <fct>
4.6 Wrap Up
In this notebook, we have covered the basic process of working with data. Specifically, we looked at how to load in data, how to view it, and how to clean it by factorizing, dropping and selecting variables and observations. This general scheme is critical to any research project, so it is important to keep in mind as you progress throughout your undergraduate economics coursework and beyond. In the next section, we will cover a larger concept which is also essential to the cleaning of a data set, but merits its own section: creating variables.
4.7 Wrap-up Table
| Command | Function |
|---|---|
rm() |
It removes all objects in the workspace. |
getwd() |
It shows the current working directory. |
setwd() |
It changes the working directory to a file path of our choice. |
read_dta() |
It imports a .dta file. |
read_csv() |
It imports a .csv file. |
read_table() |
It imports a .txt file. |
glimpse() |
It shows basic characteristics of the data. |
print() |
It shows basic characteristics of the data, displaying them on a horizontal format. |
head() |
It shows the top observations of the data. |
tail() |
It shows the bottom observations of the data. |
summary() |
It gives the minimum, 25th percentile, 50th percentile (median), 75th percentile, and max of each variable. |
sapply() |
It returns a given statistic for each variable of the dataset. |
count() |
It counts how many different values there are for a given variable. |
as_factor() |
It transforms a variable into a factor variable. |
is.na() |
It returns a value of TRUE if there are not-available observations for a given variable; otherwise, it returns FALSE. |
filter() |
It filters the data according to specific conditions that observations must satisfy. |
select() |
It keeps only certain variables of our data. |