Lifted from Stanford Data Challenge Lab material by Bill Behrman and Hadley Wickham

Spatial Data in R

Spatial packages

In R, there are two main sets of tools for dealing with spatial data: sp and sf.

  • sp has been around for a while (the first release was in 2005), and it has a rich ecosystem of tools built on top of it. However, it uses a rather complex data structure, which can make it challenging to use.

  • sf is newer (first released in October 2016!). It’s much easier to use and fits in very naturally with the tidyverse.

In this class, we’re going to use sf, so start by installing it:

#install.packages("sf")

Loading data

A shapefile is the way spatial data is most commonly stored. To read spatial data into R from an outside source, use read_sf(). Despite the name, a shapefile isn’t just one file, but is a collection of files that have the same name but different extensions. Typically you’ll have four files:

  • .shp contains the geometry, and .shx contains an index into that geometry.

  • .dbf contains metadata about each geometry (the other columns in the data frame).

  • .prf contains the coordinate system and projection information. You’ll learn more about coordinate reference systems and projections later.

read_sf() can read in the majority of spatial file formats, so don’t worry if your data isn’t in a shapefile; the chances are read_sf() will still be able to read it.

The following example reads an example dataset built into the sf package.

library(sf, warn.conflicts = F)

# The counties of North Carolina
nc <- read_sf(system.file("shape/nc.shp", package = "sf"))

You can also use packages, like USAboundaries, whose functions (like us_cities(), us_counties(), and us_states()) return sf objects.

Here is an example of getting an object that contains the (same) shapes of all North Carolina counties. Try running the code, you will notice that because it’s coming from a different package, some of the additional columns in this data set are different than those in the data set read in above.

library(USAboundaries)
nc2 <- us_counties(states = "NC")

If you get a spatial object created by another package, use st_as_sf() to convert it to sf. For example, you can take data from the maps package (included in base R) and convert it to sf:

You don’t need to run this code

library(maps)

nz_map <- map("nz", plot = FALSE)
class(nz_map)
## [1] "map"
nz_sf <- st_as_sf(nz_map)
class(nz_sf)
## [1] "sf"         "data.frame"

Data structure

nc is an ordinary data frame (or tibble), with one exception: the geometry column. The geometry column is a list column and can contain additional lists. It contains the simple features, a standard way of representing two dimesional geometries like points, lines, polygons, multilines, and multipolygons. Multilines and multipolygons are nededed to represent geographic phenomena like a river with multiple branches, or a state made up of multiple islands.

Run the code below and take some time to explore this data set and look at the structure of the geometry column especially.

glimpse(nc)
nc
nc$geometry 

## note the numbers for each MULTIPOLYGON are defining xy coordinate pairs. The type of geometry tells R whether those coordinate pairs should be connected with a line (or not, or something else). 

str(nc$geometry[[1]])

Use plot() to show the geometry. You’ll learn how to use ggplot2 for more complex data visualisations later.

plot(nc$geometry) #This plots all the polygons in the dataset

plot(nc$geometry[[1]]) #This just plots the polygon(s) in the first item in the list column

Note the use of [[ to extract a single element, here, the first polygon.

The geometry column is a list-column. In brief, they’re the richest and most complex type of column because a list can contain any other data structure, including other lists.

The geometry column is a list of lists of matrices:

  • The top-level list has one element for each “land mass” in the county. However, we can find a more interesting case. Below we find a county made up of three non-contiguous pieces.
    n <- nc$geometry %>% map_int(length)
    table(n) #This will show that 94 rows of the geometry column are lists with just one element, 4 rows have lists with 2 elements, and 2 rows have lists with 3 elements (i.e. nested lists)
## n
##  1  2  3 
## 94  4  2
    interesting <- nc$geometry[n == 3][[1]] #filter the column to only include the list with 3 elements
    plot(interesting)

The second-level list is not used in this dataset, but is needed when you have a landmass that contains a lake. (Or a landmass that contains a lake which has an island which has a pond).

Each row of the matrix gives the location of a point on the boundary of the polygon.

It is recommended to read the information at this link: https://r-spatial.github.io/sf/articles/sf1.html

Read at least through the “sfc: simple feature geometry list column” section.

Manipulating with dplyr

Since an sf object is just a data frame, you can manipulate it with dplyr. The following example gives you a taste:

nc %>%
  mutate(area = as.numeric(st_area(geometry))) %>%
  select(NAME, FIPS, geometry, area)

st_area() returns the land area for the geometries in that row. It has units in (i.e. m2), which is precise, but a little annoying to work with. I used as.numeric() to convert to a regular numeric vector.

There are many functions from the sf package that can be performed on the geometry column. Use st_ and the autofill to learn more about them, or check out the sf cheatsheet. We will not be using much/any of these functions in our class.

Spatial Visualization using ggplot

The easiest way to get started is to supply an sf object to geom_sf():

ggplot() +
  geom_sf(data = nc)

Notice that ggplot2 takes care of setting the aspect ratio correctly.

You can supply other aesthetics: for polygons, fill is most useful:

ggplot() +
  geom_sf(aes(fill = AREA), data = nc, colour = "white")

You can add layers using other data just as you would normally with a ggplot command.

## Gets the geometry for all the contiguous 48 states ##
state_shape <- us_states() %>% 
  filter(jurisdiction_type == "state", 
         state_abbr != "AK", 
         state_abbr != "HI")  

## Plots the two datasets ##
ggplot() +
  geom_sf(data = state_shape) + 
  geom_sf(data = nc)

You can combine geom_sf() with other geoms. In this case, x and y positions are typically assumed to be longitude and latitude.

ggplot() +
  geom_sf(data = nc) +
  annotate("point", x = -80, y = 35, colour = "red", size = 4)

The above example plots just one point using annotate, but if you have an sf object with point geometries you can layer those on the plot just as we combined layers of US states shapes with North Carolina county shapes.

An example from beginning to end

This video contains a full example of creating a map using sf objects. https://www.loom.com/share/4ca3d1009d904a0faf4a3628824b9bb0?sid=47d26603-900f-473f-99c5-273714d12eb4