Lifted from Stanford Data Challenge Lab material by Bill Behrman and Hadley Wickham
In R, there are two main sets of tools for dealing with spatial data: sp and sf.
sp has been around for a while (the first release was in 2005), and it has a rich ecosystem of tools built on top of it. However, it uses a rather complex data structure, which can make it challenging to use.
sf is newer (first released in October 2016!). It’s much easier to use and fits in very naturally with the tidyverse.
In this class, we’re going to use sf, so start by installing it:
#install.packages("sf")
A shapefile is the way spatial data is most commonly
stored. To read spatial data into R from an outside source, use
read_sf()
. Despite the name, a shapefile isn’t just one
file, but is a collection of files that have the same name but different
extensions. Typically you’ll have four files:
.shp
contains the geometry, and .shx
contains an index into that geometry.
.dbf
contains metadata about each geometry (the
other columns in the data frame).
.prf
contains the coordinate system and projection
information. You’ll learn more about coordinate reference systems and
projections later.
read_sf()
can read in the majority of spatial file
formats, so don’t worry if your data isn’t in a shapefile; the chances
are read_sf()
will still be able to read it.
The following example reads an example dataset built into the sf package.
library(sf, warn.conflicts = F)
# The counties of North Carolina
nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
You can also use packages, like USAboundaries, whose functions (like
us_cities()
, us_counties()
, and
us_states()
) return sf objects.
Here is an example of getting an object that contains the (same) shapes of all North Carolina counties. Try running the code, you will notice that because it’s coming from a different package, some of the additional columns in this data set are different than those in the data set read in above.
library(USAboundaries)
nc2 <- us_counties(states = "NC")
If you get a spatial object created by another package, use
st_as_sf()
to convert it to sf. For example, you can take
data from the maps package (included in base R) and convert it to
sf:
You don’t need to run this code
library(maps)
nz_map <- map("nz", plot = FALSE)
class(nz_map)
## [1] "map"
nz_sf <- st_as_sf(nz_map)
class(nz_sf)
## [1] "sf" "data.frame"
nc
is an ordinary data frame (or tibble), with one
exception: the geometry column. The
geometry column is a list column and can contain
additional lists. It contains the simple features, a
standard way of representing two dimesional geometries like points,
lines, polygons, multilines, and multipolygons. Multilines and
multipolygons are nededed to represent geographic phenomena like a river
with multiple branches, or a state made up of multiple islands.
Run the code below and take some time to explore this data set and look at the structure of the geometry column especially.
glimpse(nc)
nc
nc$geometry
## note the numbers for each MULTIPOLYGON are defining xy coordinate pairs. The type of geometry tells R whether those coordinate pairs should be connected with a line (or not, or something else).
str(nc$geometry[[1]])
Use plot()
to show the geometry. You’ll learn how to use
ggplot2 for more complex data visualisations later.
plot(nc$geometry) #This plots all the polygons in the dataset
plot(nc$geometry[[1]]) #This just plots the polygon(s) in the first item in the list column
Note the use of [[
to extract a single element, here,
the first polygon.
The geometry column is a list-column. In brief, they’re the richest and most complex type of column because a list can contain any other data structure, including other lists.
The geometry column is a list of lists of matrices:
n <- nc$geometry %>% map_int(length)
table(n) #This will show that 94 rows of the geometry column are lists with just one element, 4 rows have lists with 2 elements, and 2 rows have lists with 3 elements (i.e. nested lists)
## n
## 1 2 3
## 94 4 2
interesting <- nc$geometry[n == 3][[1]] #filter the column to only include the list with 3 elements
plot(interesting)
The second-level list is not used in this dataset, but is needed when you have a landmass that contains a lake. (Or a landmass that contains a lake which has an island which has a pond).
Each row of the matrix gives the location of a point on the boundary of the polygon.
It is recommended to read the information at this link: https://r-spatial.github.io/sf/articles/sf1.html
Read at least through the “sfc: simple feature geometry list column” section.
Since an sf object is just a data frame, you can manipulate it with dplyr. The following example gives you a taste:
nc %>%
mutate(area = as.numeric(st_area(geometry))) %>%
select(NAME, FIPS, geometry, area)
st_area()
returns the land area for the geometries in
that row. It has units in (i.e. m2), which is
precise, but a little annoying to work with. I used
as.numeric()
to convert to a regular numeric vector.
There are many functions from the sf package that can be performed on the geometry column. Use st_ and the autofill to learn more about them, or check out the sf cheatsheet. We will not be using much/any of these functions in our class.
The easiest way to get started is to supply an sf object to
geom_sf()
:
ggplot() +
geom_sf(data = nc)
Notice that ggplot2 takes care of setting the aspect ratio correctly.
You can supply other aesthetics: for polygons, fill
is
most useful:
ggplot() +
geom_sf(aes(fill = AREA), data = nc, colour = "white")
You can add layers using other data just as you would normally with a ggplot command.
## Gets the geometry for all the contiguous 48 states ##
state_shape <- us_states() %>%
filter(jurisdiction_type == "state",
state_abbr != "AK",
state_abbr != "HI")
## Plots the two datasets ##
ggplot() +
geom_sf(data = state_shape) +
geom_sf(data = nc)
You can combine geom_sf()
with other geoms. In this
case, x
and y
positions are typically assumed
to be longitude and latitude.
ggplot() +
geom_sf(data = nc) +
annotate("point", x = -80, y = 35, colour = "red", size = 4)
The above example plots just one point using annotate, but if you have an sf object with point geometries you can layer those on the plot just as we combined layers of US states shapes with North Carolina county shapes.
This video contains a full example of creating a map using sf objects. https://www.loom.com/share/4ca3d1009d904a0faf4a3628824b9bb0?sid=47d26603-900f-473f-99c5-273714d12eb4