Warning: package 'digest' was built under R version 4.4.2
Part 1: Geospatial Data
Geospatial data is data coded to represent physical objects. For example, let’s say we’re interested in analyzing the accessibility of healthcare in BC. A dataset containing locations of emergency departments (coded in longitude and latitude coordinates) and population density per municipality would constitute a geospatial dataset. We could use the geospatial data to estimate the share of BC residents within 10km of an emergency department. That analysis would involve calculating relationships between physical objects (i.e., location of emergency departments) and overlaying numeric data (i.e., population density per municipality) on the physical objects.
The first step to learning how to conduct geospatial analyses like this in R is to understand how geospatial data is stored - that’s what we’ll cover in this section.
Vector vs Raster Data
In R, geospatial data can be stored as either vector data or raster data.
Vector data is coded in a collection of mathematical objects, such as points, lines, and polygons. Geospatial objects with vector data have discrete boundaries and are associated with specific locations through a coordinate reference systems (CRS).
Raster data is coded as 2-D cells with constant size, called pixels. These pixels are accompanied with information that links them with a specific location.
Vector data is most widely used in the social sciences because applications in politics or economics typically require discrete boundaries of administrative regions (e.g., country or state borders). For this reason, we’ll focus on conducting geospatial analysis with vector data.
Vector data codes geospatial objects with the following elements:
Points, coded as c(x, y): the most basic element, used when the area of the objects are not meaningful (e.g., locations of emergency departments in BC).
Linestrings, coded as rbind(c(x1, y1), c(x2, y2), c(x3, y3)): a series of points, used when the length of an object is meaningful but the area is not (e.g., rivers, roads).
Polygons, coded as list(rbind(c(x1, y1), c(x2, y2), c(x3, y3), c(x1, y1))): a series of closed points, used when the area of an object is meaningful (e.g., Canadian provinces, metropolitan areas, protected areas).
Note: to be closed objects, polygons need to start and end at the same points; notice above that the polygon starts at c(x1, y1) and also ends at c(x1, y1).
What exactly are these (x, y) coordinates? How does R (and the user) know which locations these coordinates represent?
Coordinate Reference Systems
A coordinate reference system can be thought of as the base map in which your geospatial objects will be overlayed. There are two main types of CRS for vector data: geographic CRS and projected CRS.
Geographic CRS map locations with longitude and latitude coordinates. The x’s and y’s for the points, linestrings, and polygons introduced above will simply be longitude and latitude coordinates on the base map.
Note: Geographic coordinates are spherical! This means that you cannot use the distance formula you learned in high-school to calculate the distance between two points coded in c(longitude, latitude) format. More on this later.
Projected CRS map locations with Cartesian coordinates on a flat surface. The x’s and y’s for the points, linestrings, and polygons introduced above will simply be x’s and y’s of a regular xy plane grid. There are different ways to project the earth on a flat surface, and that implies different ways to store objects and the relationships between them (e.g., conic, cylindrical, equidistant, equal-area…). More on this later.
The sf package
The sf package is currently the most widely used package for manipulating geospatial data in vector format in R. The package supports all of the elements described above (i.e., points, linestrings, polygons), as well as any combination of those objects (i.e., multipoints, multilinestrings, multipolygons, and geometry collections). We’ll introduce them as needed throughout this notebook.
The beauty of the sf package is that it is compatible with tidyverse: geometric objects are stored in dataframes, and we can manipulate those objects with the typical tidyverse functions we use with non-spatial datasets.
To illustrate, let’s create some geospatial data from scratch.
# creating a pointa_point <-c(0,1)class(a_point)
[1] "numeric"
Let’s transform our point into a geospatial object using the st_point() function.
# transforming point into geospatial objectgeo_point <-st_point(a_point)geo_point
POINT (0 1)
class(geo_point)
[1] "XY" "POINT" "sfg"
Notice that the data is coded as POINT (0 1) and the type of the object is sfg, which stands for simple feature geometry. The functions to transform R numeric vectors into sfg objects are:
st_point()
st_linestring()
st_polygon()
We can bind these sfg objects into a simple feature column with the function st_sfc().
# creating two more pointsanother_point <-c(1,2)yet_another_point <-c(3,3)# transforming points into `sfg` objectsgeo_point_2 <-st_point(another_point)geo_point_3 <-st_point(yet_another_point)# combining `sfg` objects into a simple feature column sfc_obj <-st_sfc(geo_point, geo_point_2, geo_point_3)class(sfc_obj)
[1] "sfc_POINT" "sfc"
Now the type of the object is sfc. R also recognizes that it is a sfc_POINT, because we only passed points to the simple features column. Simple feature columns support different types of simple feature objects in a same column (e.g., a column with a point, a linestring, and a geometry collection).
Since we have coded our elements as geospatial data, we can now plot the points in space.
plot(sfc_obj)
Let’s take a look at the output of sfc_obj.
sfc_obj
Geometry set for 3 features
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 0 ymin: 1 xmax: 3 ymax: 3
CRS: NA
POINT (0 1)
POINT (1 2)
POINT (3 3)
The output of this object gives us the following information:
We only have points as geometric objects in the column
The dimension of our objects is 2-D (i.e., we only passed 2 coordinates for each point c(x, y))
The bounds of our plot are [c(0,1), c(3,1), c(3,3), c(0,3)]
We have not specified a CRS (i.e., we don’t have a base map)
We can specify the CRS for our geometric data when creating the sfc object with the parameter crs. Here, we have chosen the crs “EPSG:4326”, which is a basic map of the world.
Geometry set for 3 features
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 0 ymin: 1 xmax: 3 ymax: 3
Geodetic CRS: WGS 84
POINT (0 1)
POINT (1 2)
POINT (3 3)
See now that R knows that our data refers to the Geodetic CRS WGS 84. A simple search of the term will tell you that this is one of the most widely used geographic CRS’s, in which the c(x, y) represents latitude and longitude coordinates.
Once we specify the CRS, R knows which locations on Earth our geometric objects refer to. This allows us to overlay objects we create on existing geospatial objects that share the same CRS.
Let’s see this in practice. Let’s load the world dataset from the spData package. We’ll use this dataset, which contains country borders, to overlay two lines connecting Salvador, Brazil, to (1) Luanda, Angola, and to (2) Maputo, Mozambique.
# check the data and plot the `world` geometryhead(world)
Simple feature collection with 6 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -180 ymin: -18.28799 xmax: 180 ymax: 83.23324
Geodetic CRS: WGS 84
# A tibble: 6 × 11
iso_a2 name_long continent region_un subregion type area_km2 pop lifeExp
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 FJ Fiji Oceania Oceania Melanesia Sove… 1.93e4 8.86e5 70.0
2 TZ Tanzania Africa Africa Eastern … Sove… 9.33e5 5.22e7 64.2
3 EH Western S… Africa Africa Northern… Inde… 9.63e4 NA NA
4 CA Canada North Am… Americas Northern… Sove… 1.00e7 3.55e7 82.0
5 US United St… North Am… Americas Northern… Coun… 9.51e6 3.19e8 78.8
6 KZ Kazakhstan Asia Asia Central … Sove… 2.73e6 1.73e7 71.6
# ℹ 2 more variables: gdpPercap <dbl>, geom <MULTIPOLYGON [°]>
plot(world["name_long"]) # specify that we want to plot the attribute `name_long`
We can see from the output above that the country boundaries are stored as multipolygons, and that the CRS for the world geometry is WGS 84.
As shown earlier, the argument used to transform numeric vectors to this CRS is crs = "EPSG:4326". Let’s use that information to draw the lines from Salvador to Luanda and Maputo and overlay them on the plot. We’ll need the coordinates of these cities, which we can find by searching for them online.
Note: for this CRS, we need to specify the coordinates as c(longitude, latitude). The longitude and latitude of Salvador are approximately c(-38.5, -13) and those of Luanda and Maputo are c(13.2, -8.8) and c(32.6, -26) respectively.
The plot() function is the easiest way to plot geospatial objects in R. The default of this function plots one map for each attribute (try running plot(world) to see for yourself), so we specify name_long to plot a single map for the output. The parameters reset = FALSE and add = TRUE are needed to overlay plots.
Geospatial Operations
We now turn our attention to objects that contain both simple features and other data types as columns. The world dataset, which we used in the previous section, has those features. Let’s examine this dataset further.
str(world)
sf [177 × 11] (S3: sf/tbl_df/tbl/data.frame)
$ iso_a2 : chr [1:177] "FJ" "TZ" "EH" "CA" ...
$ name_long: chr [1:177] "Fiji" "Tanzania" "Western Sahara" "Canada" ...
$ continent: chr [1:177] "Oceania" "Africa" "Africa" "North America" ...
$ region_un: chr [1:177] "Oceania" "Africa" "Africa" "Americas" ...
$ subregion: chr [1:177] "Melanesia" "Eastern Africa" "Northern Africa" "Northern America" ...
$ type : chr [1:177] "Sovereign country" "Sovereign country" "Indeterminate" "Sovereign country" ...
$ area_km2 : num [1:177] 19290 932746 96271 10036043 9510744 ...
$ pop : num [1:177] 8.86e+05 5.22e+07 NA 3.55e+07 3.19e+08 ...
$ lifeExp : num [1:177] 70 64.2 NA 82 78.8 ...
$ gdpPercap: num [1:177] 8222 2402 NA 43079 51922 ...
$ geom :sfc_MULTIPOLYGON of length 177; first list element: List of 3
..$ :List of 1
.. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
..$ :List of 1
.. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
..$ :List of 1
.. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "geom"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:10] "iso_a2" "name_long" "continent" "region_un" ...
class(world)
[1] "sf" "tbl_df" "tbl" "data.frame"
We can see that world contains the boundaries of countries as multipolygons, but also contains attributes of those countries as other types of data. Recognizing this, R categorizes this dataset as both sf and data.frame.
We can create sf objects with categorical and numeric attributes using the function st_sf(). Let’s show this by creating a custom dataset for three cities in Portugal.
# create a data.frame object containing the cities and their populationscity_attr <-data.frame( name =c("Lisboa", "Coimbra", "Porto"),population =c(545796, 106655, 231800) )# specify points as c(lon, lat) coordinates of citiesl_coord <-st_point(c(-9.1, 38.7)) c_coord <-st_point(c(-8.4, 40.2))p_coord <-st_point(c(-8.6, 41.1))# create a column with the specified CRScoordinates <-st_sfc(l_coord, c_coord, p_coord, crs ="EPSG:4326") # create `sf` object with the data.frame and the sfccity_data <-st_sf(city_attr, geometry = coordinates) city_data
Simple feature collection with 3 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -9.1 ymin: 38.7 xmax: -8.4 ymax: 41.1
Geodetic CRS: WGS 84
name population geometry
1 Lisboa 545796 POINT (-9.1 38.7)
2 Coimbra 106655 POINT (-8.4 40.2)
3 Porto 231800 POINT (-8.6 41.1)
class(city_data)
[1] "sf" "data.frame"
The image below illustrates the process of creating sf objects from sfc and data.frame objects, which we did above.
Lovelace, Robin and Nowosad, Jakub and Muenchow, Jannes, (2019). Geocomputation with R.
Now let’s suppose we want to merge city_data with data from the major rivers in Portugal that pass through those cities.
Note: These are not the actual coordinates of the rivers. The points are simulated and made so that they pass through the major cities.
# creating river data# data.frame objectriver_names =data.frame( river_name =c("Tejo", "Mondego", "Douro"))# specify points as c(lon, lat) coordinates of riverst_coord <-st_linestring(rbind(c(-10, 37), c(-9.1, 38.7), c(-9.1, 39.3))) m_coord <-st_linestring(rbind(c(-8, 38.5), c(-8.4, 40.2), c(-7.8, 41.9)))d_coord <-st_linestring(rbind(c(-8, 42),c(-8.6, 41.1), c(-8.8, 40)))# create a column with the specified CRSrivers <-st_sfc(t_coord, m_coord, d_coord, crs ="EPSG:4326") # create `sf` object with the data.frame and the sfcriver_data <-st_sf(river_names, river_geometry = rivers) plot(river_data)
One way of joining the data is to simply attach columns with cbind().
Simple feature collection with 3 features and 3 fields
Active geometry column: geometry
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -9.1 ymin: 38.7 xmax: -8.4 ymax: 41.1
Geodetic CRS: WGS 84
name population river_name geometry
1 Lisboa 545796 Tejo POINT (-9.1 38.7)
2 Coimbra 106655 Mondego POINT (-8.4 40.2)
3 Porto 231800 Douro POINT (-8.6 41.1)
river_geometry
1 LINESTRING (-10 37, -9.1 38...
2 LINESTRING (-8 38.5, -8.4 4...
3 LINESTRING (-8 42, -8.6 41....
Another way to join datasets is through geospatial operations. These operations allow us to filter, merge, or subset dataframes based on whether points, lines, and polygons touch, intersect, or contain other geospatial objects.
For example, let’s say we don’t know which river borders which city and we want to use geospatial operations to find out. We can use st_join() to merge both datasets based on intersections. st_join() is basically a left join in which the matches are determined by the relationships between the spatial objects.
# merge city_data with river_data pt_data <-st_join(city_data, river_data["river_name"]) # specify "river_name" to keep it in the merged datasetpt_data
Simple feature collection with 3 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -9.1 ymin: 38.7 xmax: -8.4 ymax: 41.1
Geodetic CRS: WGS 84
name population river_name geometry
1 Lisboa 545796 Tejo POINT (-9.1 38.7)
2 Coimbra 106655 Mondego POINT (-8.4 40.2)
3 Porto 231800 Douro POINT (-8.6 41.1)
st_join() assigns the names of the rivers to the observations in which the point (i.e., city) intersects with the line (i.e., river).
Note: st_join() drops the geometry of the second object.
Similar to st_join(), we can filter, merge, or subset sf objects with other geospatial functions. The most common types of geospatial operations are:
Think Deeper: once you get to the correct answer of question 2, inspect the output of answer_2. What does it mean? How is it different than the intersection of two objects?
Which of the following functions should be used to create an object with a simple feature column and other attributes?
as_data_frame()
st_sfg()
st_sfc()
st_sf()
# Replace ... by your answer of "A", "B", "C" or "D"answer_3 <-"..."test_3()
What are the common coordinate inputs for geographic CRS?
c(lat, lon)
c(lon, lat)
c(x, y)
c(y, x)
# Replace ... by your answer of "A", "B", "C" or "D"answer_4 <-"..."test_4()
Part 2: Hedonic Pricing with Data from Athenian Properties
In this section, we’re going to estimate the value of apartment characteristics with a hedonic pricing function for apartments located in Athens, Greece. We’ll use the geospatial datasets depmunic and properties from the package spData.
Theory
Typically, economists rely on market prices to estimate consumers’ marginal willingness to pay for a good. When the good in question is not traded in markets, economists must resort to other strategies. One of those strategies is the hedonic pricing method.
The main assumption of hedonic pricing is that the total price of a good equals the sum of the prices of its components. Using this assumption, we can deconstruct the price of a good traded in markets and attribute part of the total price to each of its non-traded component parts. For example, we can model the price of an apartment as a function of its size, view, age, location, number of bathrooms, proximity to public service, etc. If we take two apartments that are exactly equal but only differ on their view (e.g., one might be on the 2nd floor and the other on the 30th floor of the same building), their difference in price must be the added value of the view. If we have data on the prices and attributes of several apartments, we could leverage hedonic pricing to estimate how much consumers value a nice view.
Think Deeper: what are the other underlying assumptions of hedonic pricing? When might it not be appropriate to use this method to value non-traded goods?
A model for the hedonic price of an apartment could be estimated with the following regression:
Classes 'sf' and 'data.frame': 7 obs. of 8 variables:
$ num_dep : int 1 2 3 4 5 6 7
$ airbnb : num 2171 721 524 144 231 ...
$ museums : num 17 1 0 0 0 0 0
$ population: num 72962 102439 45168 84544 98283 ...
$ pop_rest : num 8202 5009 2735 4167 5099 ...
$ greensp : num 433582 478951 43312 40656 41369 ...
$ area : num 7.21 4.84 5.62 4.51 4.02 ...
$ geometry :sfc_POLYGON of length 7; first list element: List of 1
..$ : num [1:319, 1:2] 476648 476721 476725 476808 476822 ...
..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:7] NA NA NA NA ...
We can see that the dataset has attributes of the districts: airbnb, museums, population, …
We can also see that the CRS used for this dataset is a Greek CRS: Projected CRS: GGRS87 / Greek Grid. This tells us that we’re dealing with a projected CRS tailored to Greece. A quick search will find that the units of reference are in meters.
Note: a crucial part of geospatial data exploration is learning about the CRS of the dataset. This not only affects the relationships within your own dataset but also determines whether you can merge or filter geospatial objects based on other datasets.
Let’s plot the depmunic data. We specify the attribute greenspace ["greensp"] to plot a chart of the green space exclusively.
plot(depmunic["greensp"])
The plot shows that the district in yellow (district 2) is the one with largest greenspace, almost 10 times as much as the district in dark blue (district 4).
Now let’s take a look at our properties data.
head(properties)
Simple feature collection with 6 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 475649.5 ymin: 4201115 xmax: 477290.7 ymax: 4207786
Projected CRS: GGRS87 / Greek Grid
id size price prpsqm age dist_metro geometry
7836 7836 74 95000 1283.7838 47 623.874638 POINT (477144.1 4207786)
7238 7238 77 165000 2142.8571 10 509.471981 POINT (475649.5 4207224)
368 368 85 22000 258.8235 42 862.132569 POINT (477269.9 4201115)
938 938 85 225000 2647.0588 51 5.461505 POINT (476099.5 4202088)
6693 6693 102 50000 490.1961 37 312.350026 POINT (476260.6 4206307)
1953 1953 150 500000 3333.3333 37 485.829813 POINT (477290.7 4203248)
str(properties)
Classes 'sf' and 'data.frame': 1000 obs. of 7 variables:
$ id : chr "7836" "7238" "368" "938" ...
$ size : int 74 77 85 85 102 150 70 35 100 71 ...
$ price : int 95000 165000 22000 225000 50000 500000 105000 8500 160000 31000 ...
$ prpsqm : num 1284 2143 259 2647 490 ...
$ age : num 47 10 42 51 37 37 1 47 11 57 ...
$ dist_metro: num 623.87 509.47 862.13 5.46 312.35 ...
$ geometry :sfc_POINT of length 1000; first list element: 'XY' num 477144 4207786
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA
..- attr(*, "names")= chr [1:6] "id" "size" "price" "prpsqm" ...
We can see that the dataset has attributes of the properties: size, price, age, …
We can see that the CRS used for this dataset is also the Greek CRS. This is good news: both our datasets are referenced to the same grid.
Let’s plot our properties dataset.
plot(properties["prpsqm"])
The plot shows the distribution of properties with their color representing the price per square meter prpsqm. It’s hard to see a pattern from this chart, but it seems that most of the expensive properties are located towards the south.
Let’s plot our property locations on top of our district boundaries.
plot(depmunic["num_dep"], reset =FALSE)plot(properties["price"], add =TRUE, col ="darkgreen")
Empirical Methods
Let’s try to put together a model to explain the price of properties in Athens using the hedonic pricing method. Our proposed specification will be: \[
P_{i} = \beta_0 + \beta_1 Size_{i} + \beta_2 Age_{i} + \beta_{3} Dist_{i} + \sum_{j=2}^{7} \gamma_{j} D_{ij} + \epsilon
\]
where
\(P_{i}\) is the price (in euros) per square meter of apartment \(i\),
\(Size_{i}\) is the size in square meters of apartment \(i\),
\(Age_{i}\) is the age in years of apartment \(i\),
\(Dist_{i}\) is the distance between apartment \(i\) and the closest metro in meters,
\(D_{ij}\) are dummy variables indexing the districts \(j\) for apartment \(i\).
The dummy variables are included to account for district-specific characteristics, such as the number of museums, parks, area of greenspace, and other district specific factors that we don’t have data on.
Looking at our properties dataset, we see that we already have data on all of our regressors except district location of the apartments. We need to use what we learned about geospatial operations to assign a categorical variable indicating district location to each property.
Below we use the function st_join() to merge both datasets based on the intersection of the spatial elements, and pass the attribute num_dep (the district number) to the merged dataset.
# find location districts for each propertymerged_data <-st_join(properties, depmunic["num_dep"])merged_data
Simple feature collection with 1000 features and 7 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 473914.2 ymin: 4200169 xmax: 480442.4 ymax: 4208576
Projected CRS: GGRS87 / Greek Grid
First 10 features:
id size price prpsqm age dist_metro num_dep geometry
7836 7836 74 95000 1283.7838 47 623.874638 5 POINT (477144.1 4207786)
7238 7238 77 165000 2142.8571 10 509.471981 5 POINT (475649.5 4207224)
368 368 85 22000 258.8235 42 862.132569 2 POINT (477269.9 4201115)
938 938 85 225000 2647.0588 51 5.461505 1 POINT (476099.5 4202088)
6693 6693 102 50000 490.1961 37 312.350026 6 POINT (476260.6 4206307)
1953 1953 150 500000 3333.3333 37 485.829813 1 POINT (477290.7 4203248)
4950 4950 70 105000 1500.0000 1 892.577424 7 POINT (478399.6 4205376)
5037 5037 35 8500 242.8571 47 966.710340 6 POINT (476882.3 4205444)
7673 7673 100 160000 1600.0000 11 737.181066 5 POINT (476985.3 4207504)
6681 6681 71 31000 436.6197 57 296.557268 4 POINT (475656.2 4206286)
Notice that st_join() dropped the district boundaries. That is fine, since num_dep already indicates where each property is located. As a matter of fact, we don’t even need the points anymore. We only needed them to connect the apartments to their district! Let’s clean our data before running our model.
Note: we should drop spatial objects with the function st_drop_geometry(), as opposed to select().
model_data <- merged_data %>%select(size, prpsqm, age, dist_metro, num_dep)%>%mutate(num_dep =as_factor(num_dep))%>%# so that R understand that this is a categorical variablest_drop_geometry()str(model_data)
The most significant factor contributing to the price of an apartment in Athens appears to be district location. Although age and distance to metro are statistically significant, their effects are quite small (as long as the property is not super old or super far from the closest metro). However, we do see a wide range of effects for district locations. Once we control for size, age, and distance from metro, the mere fact of being in district 4 decreases the price per square meter by over 1,000 euros! Maybe it’s the lack of greenspace that we have noted earlier…
Our model breaks down the price of an apartment as follows:
District 1 is the base case (i.e., when all dummies equal zero), and is the most accretive location to the price of a property. That makes sense, since District 1 is the historic center of Athens.
Now, let’s look at how well our model actually predicts the price per square meter of properties. This can shed light on the appropriateness of using hedonic pricing strategies for properties in Athens.
summary(model1)$adj.r.squared
[1] 0.3568844
Our adjusted \(R^2\) says that our model only explains ~35% of the variation in the price of properties in Athens. The cause of low predictive power could be an error of methodology (prices of properties in Athens might not be the sum of their component parts!), or a missing variable problem.
Think Deeper: what component parts of properties might we be missing in our hedonic pricing model?