# 19 Spatial vector analysis using `sf`

## 19.1 Learning Objectives

In this lesson, you will learn:

• How to use the `sf` package to analyze geospatial data
• Static mapping with ggplot
• interactive mapping with `leaflet`

## 19.2 Introduction

From the sf vignette:

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.

The sf package is an R implementation of Simple Features. This package incorporates:

• a new spatial data class system in R
• functions for reading and writing data
• tools for spatial operations on vectors

Most of the functions in this package starts with prefix `st_` which stands for spatial and temporal.

In this tutorial, our goal is to use a shapefile of Alaska regions and data on population in Alaska by community to create a map that looks like this: All of the data used in this tutorial is available at https://github.com/NCEAS/arctic-data-training, in the `materials/arctic-data-center-training/data/shapefiles` directory.

Let’s read in the region shapefile data. In this tutorial, we use a simplified version of this dataset: Jared Kibele and Jeanette Clark. 2018. State of Alaska’s Salmon and People Regional Boundaries. Knowledge Network for Biocomplexity. doi:10.5063/F1125QWP.

This simplified version is used in this tutorial to speed processing and plotting, and contains topological errors. See the original to obtain the topologically correct version.

``````## Read in shapefile using sf

``````## Simple feature collection with 6 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.2296 ymin: 51.15702 xmax: 179.8567 ymax: 71.43957
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 6 x 4
##   region_id region     mgmt_area                                   geometry
##       <int> <chr>          <dbl>                         <MULTIPOLYGON [°]>
## 1         1 Aleutian …         3 (((-171.1345 52.44974, -171.1686 52.41744…
## 2         2 Arctic             4 (((-139.9552 68.70597, -139.9893 68.70516…
## 3         3 Bristol B…         3 (((-159.8745 58.62778, -159.8654 58.61376…
## 4         4 Chignik            3 (((-155.8282 55.84638, -155.8049 55.86557…
## 5         5 Copper Ri…         2 (((-143.8874 59.93931, -143.9165 59.94034…
## 6         6 Kodiak             3 (((-151.9997 58.83077, -152.0358 58.82714…``````

sf objects usually have two types - `sf` and `data.frame`. Two main differences comparing to a regular `data.frame` object are spatial metadata (`geometry type`, `dimension`, `bbox`, `epsg (SRID)`, `proj4string`) and additional column - typically named `geom` or `geometry`.

``class(ak_shp_sf)``
``##  "sf"         "tbl_df"     "tbl"        "data.frame"``

### 19.3.1 Coordinate Reference System

Every `sf` object needs a coordinate reference system (or `crs`) defined in order to work with it correctly. A coordinate reference system contains both a datum and a projection. The datum is how you georeference your points (in 3 dimensions!) onto a spheroid. The projection is how these points are mathematically transformed to represent the georeferenced point on a flat piece of paper. All coordinate reference systems require a datum. However, some coordinate reference systems are “unprojected” (also called geographic coordinate systems). Coordinates in latitude/longitude use a geographic (unprojected) coordinate system. One of the most commonly used geographic coordinate systems is WGS 1984.

You can view what `crs` is set by using the function `st_crs`

``st_crs(ak_shp_sf)``
``````## Coordinate Reference System:
##   EPSG: 4326
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"``````

This is pretty confusing looking. Without getting into the details, that long string says that this data has a greographic coordinate system (WGS84) with no projection. A convenient way to reference `crs` quickly is by using the EPSG code, a number that represents a standard projection and datum. You can check out a list of (lots!) of EPSG codes here.

You will often need to transform your geospatial data from one coordinate system to another. The `st_transform` function does this quickly for us. You may have noticed the maps above looked wonky because of the dateline. We might want to set a different projection for this data so it plots nicer. A good one for Alaska is called the Alaska Albers projection, with an EPSG code of 3338.

``````ak_shp_sf <- ak_shp_sf %>%
st_transform(crs = 3338)

st_crs(ak_shp_sf)``````
``````## Coordinate Reference System:
##   EPSG: 3338
##   proj4string: "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"``````
``plot(ak_shp_sf)`` Much better!

### 19.3.2 Attributes

sf objects can be used as a regular `data.frame` object in many operations

``ak_shp_sf``
``````## Simple feature collection with 13 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -2175328 ymin: 405653.9 xmax: 1579226 ymax: 2383770
## epsg (SRID):    3338
## proj4string:    +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 13 x 4
##    region_id region      mgmt_area                                 geometry
##        <int> <chr>           <dbl>                       <MULTIPOLYGON [m]>
##  1         1 Aleutian I…         3 (((-1156666 420855.1, -1159837 417990.3…
##  2         2 Arctic              4 (((571289.9 2143072, 569941.5 2142691, …
##  3         3 Bristol Bay         3 (((-339688.6 973904.9, -339302 972297.3…
##  4         4 Chignik             3 (((-114381.9 649966.8, -112866.8 652065…
##  5         5 Copper Riv…         2 (((561012 1148301, 559393.7 1148169, 55…
##  6         6 Kodiak              3 (((115112.5 983293, 113051.3 982825.9, …
##  7         7 Kotzebue            4 (((-678815.3 1819519, -677555.2 1820698…
##  8         8 Kuskokwim           4 (((-1030125 1281198, -1029858 1282333, …
##  9         9 Cook Inlet          2 (((35214.98 1002457, 36660.3 1002038, 3…
## 10        10 Norton Sou…         4 (((-848357 1636692, -846510 1635203, -8…
## 11        11 Prince Wil…         2 (((426007.1 1087250, 426562.5 1088591, …
## 12        12 Southeast           1 (((1287777 744574.1, 1290183 745970.8, …
## 13        13 Yukon               4 (((-375318 1473998, -373723.9 1473487, …``````
``nrow(ak_shp_sf)``
``##  13``
``ncol(ak_shp_sf)``
``##  4``

## 19.4`sf` & the Tidyverse

Since `sf` objects are dataframes, they play nicely with packages in the tidyverse. Here are a couple of simple examples:

`select()`

``````ak_shp_sf %>%
select(region)``````
``````## Simple feature collection with 13 features and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -2175328 ymin: 405653.9 xmax: 1579226 ymax: 2383770
## epsg (SRID):    3338
## proj4string:    +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 13 x 2
##    region                                                          geometry
##    <chr>                                                 <MULTIPOLYGON [m]>
##  1 Aleutian Islands (((-1156666 420855.1, -1159837 417990.3, -1161898 4169…
##  2 Arctic           (((571289.9 2143072, 569941.5 2142691, 569158.2 214214…
##  3 Bristol Bay      (((-339688.6 973904.9, -339302 972297.3, -339229.2 971…
##  4 Chignik          (((-114381.9 649966.8, -112866.8 652065.8, -108836.8 6…
##  5 Copper River     (((561012 1148301, 559393.7 1148169, 557797.7 1148492,…
##  6 Kodiak           (((115112.5 983293, 113051.3 982825.9, 110801.3 983211…
##  7 Kotzebue         (((-678815.3 1819519, -677555.2 1820698, -675557.8 182…
##  8 Kuskokwim        (((-1030125 1281198, -1029858 1282333, -1028980 128403…
##  9 Cook Inlet       (((35214.98 1002457, 36660.3 1002038, 36953.11 1001186…
## 10 Norton Sound     (((-848357 1636692, -846510 1635203, -840513.7 1632225…
## 11 Prince William … (((426007.1 1087250, 426562.5 1088591, 427711.6 108999…
## 12 Southeast        (((1287777 744574.1, 1290183 745970.8, 1292940 746262.…
## 13 Yukon            (((-375318 1473998, -373723.9 1473487, -373064.8 14739…``````

Note the sticky geometry column! The geometry column will stay with your `sf` object even if it is not called explicitly.

`filter()`

``````ak_shp_sf %>%
filter(region == "Southeast")``````
``````## Simple feature collection with 1 feature and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 559475.7 ymin: 722450 xmax: 1579226 ymax: 1410576
## epsg (SRID):    3338
## proj4string:    +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 1 x 4
##   region_id region   mgmt_area                                     geometry
##       <int> <chr>        <dbl>                           <MULTIPOLYGON [m]>
## 1        12 Southea…         1 (((1287777 744574.1, 1290183 745970.8, 1292…``````

### 19.4.1 Joins

You can also use the `sf` package to create spatial joins, useful for when you want to utilize two datasets together. As an example, let’s ask a question: how many people live in each of these Alaska regions?

We have some population data, but it gives the number of people by city, not by region. To determine the number of people per region we will need to:

• read in the city data from a csv and turn it into an `sf` object
• use a spatial join (`st_join`) to assign each city to a region
• use `group_by` and `summarize` to calculate the total population by region

First, read in the population data as a regular `data.frame`. This data is derived from: Jeanette Clark, Sharis Ochs, Derek Strong, and National Historic Geographic Information System. 2018. Languages used in Alaskan households, 1990-2015. Knowledge Network for Biocomplexity. doi:10.5063/F11G0JHX. Unnecessary columns were removed and the most recent year of data was selected.

``pop <- read.csv("data/shapefiles/alaska_population.csv")``

The `st_join` function is a spatial left join. The arguments for both the left and right tables are objects of class `sf` which means we will first need to turn our population `data.frame` with latitude and longitude coordinates into an `sf` object.

We can do this easily using the `st_as_sf` function, which takes as arguments the coordinates and the `crs`. The `remove = F` specification here ensures that when we create our `geometry` column, we retain our original `lat` `lng` columns, which we will need later for plotting. Although it isn’t said anywhere explicitly in the file, let’s assume that the coordinate system used to reference the latitude longitude coordinates is WGS84, which has a `crs` number of 4236.

``````pop_sf <- st_as_sf(pop,
coords = c('lng', 'lat'),
crs = 4326,
remove = F)

``````## Simple feature collection with 6 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -176.6581 ymin: 51.88 xmax: -154.1703 ymax: 62.68889
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   year     city      lat       lng population                   geometry
## 1 2015     Adak 51.88000 -176.6581        122    POINT (-176.6581 51.88)
## 2 2015   Akhiok 56.94556 -154.1703         84 POINT (-154.1703 56.94556)
## 3 2015 Akiachak 60.90944 -161.4314        562 POINT (-161.4314 60.90944)
## 4 2015    Akiak 60.91222 -161.2139        399 POINT (-161.2139 60.91222)
## 5 2015   Akutan 54.13556 -165.7731        899 POINT (-165.7731 54.13556)
## 6 2015 Alakanuk 62.68889 -164.6153        777 POINT (-164.6153 62.68889)``````

Now we can do our spatial join! You can specify what geometry function the join uses (`st_intersects`, `st_within`, `st_crosses`, `st_is_within_distance`, …) in the `join` argument. The geometry function you use will depend on what kind of operation you want to do, and the geometries of your shapefiles.

In this case, we want to find what region each city falls within, so we will use `st_within`.

``pop_joined_sf <- st_join(pop_sf, ak_shp_sf, join = st_within)``

This gives an error!

``Error: st_crs(x) == st_crs(y) is not TRUE``

Turns out, this won’t work right now because our coordinate reference systems are not the same. Luckily, this is easily resolved using `st_transform`, and projecting our population object into Alaska Albers.

``pop_sf <- st_transform(pop_sf, crs = 3338)``
``````pop_joined_sf <- st_join(pop_sf, ak_shp_sf, join = st_within)

plot(pop_joined_sf["region"])`````` ### 19.4.2 Group and summarize

Next we compute the total population for each region. In this case, we want to do a `group_by` and `summarise` as this were a regular `data.frame` - otherwise all of our point geometries would be aggregated by region which is not what we want. We remove the sticky geometry using `as.data.frame`, on the advice of the `sf::tidyverse` help page.

``````pop_region <- pop_joined_sf %>%
as.data.frame() %>%
group_by(region) %>%
summarise(total_pop = sum(population))

``````## # A tibble: 6 x 2
##   region           total_pop
##   <chr>                <int>
## 1 Aleutian Islands      8840
## 2 Arctic                8419
## 3 Bristol Bay           6947
## 4 Chignik                311
## 5 Cook Inlet          408254
## 6 Copper River          2294``````

And use a regular `left_join` to get the information back to the Alaska region shapefile. Note that we need this step in order to retain our region geometries so that we can make some maps.

``ak_pop_sf <- left_join(ak_shp_sf, pop_region)``
``## Joining, by = "region"``
``````#plot to check
plot(ak_pop_sf["total_pop"])`````` So far, we have learned how to use `sf` and `dplyr` to use a spatial join on two datasets and calculate a summary metric from the result of that join.

The `group_by` and `summarize` functions can also be used on `sf` objects to summarize within a dataset and combine geometries. Many of the `tidyverse` functions have methods specific for `sf` objects, some of which have additional arguments that wouldn’t be relevant to the `data.frame` methods. You can run `?sf::tidyverse` to get documentation on the `tidyverse` `sf` methods.

Let’s try some out. Say we want to calculate the population by Alaska management area, as opposed to region.

``````ak_mgmt <- ak_pop_sf %>%
group_by(mgmt_area) %>%
summarize(total_pop = sum(total_pop))

plot(ak_mgmt["total_pop"])`````` Notice that the region geometries were combined into a single polygon for each management area.

If we don’t want to combine geometries, we can specifcy `do_union = F` as an argument.

``````ak_mgmt <- ak_pop_sf %>%
group_by(mgmt_area) %>%
summarize(total_pop = sum(total_pop), do_union = F)

plot(ak_mgmt["total_pop"])`````` ### 19.4.3 Save

Save the spatial object to disk using `write_sf()` and specifying the filename. Writing your file with the extension .shp will assume an ESRI driver driver, but there are many other format options available.

``write_sf(ak_pop_sf, "shapefiles/ak_regions_population.shp", delete_layer = TRUE)``

## 19.5 Visualize with ggplot

`ggplot2` now has integrated functionality to plot sf objects using `geom_sf()`.

We can plot `sf` objects just like regular data.frames using `geom_sf`.

``````ggplot(ak_pop_sf) +
geom_sf(aes(fill = total_pop)) +
theme_bw() +
labs(fill = "Total Population") +
scale_fill_continuous(low = "khaki", high =  "firebrick", labels = comma)`````` We can also plot multiple shapefiles in the same plot. Say if we want to visualize rivers in Alaska, in addition to the location of communities, since many communities in Alaska are on rivers. We can read in a rivers shapefile, doublecheck the `crs` to make sure it is what we need, and then plot all three shapefiles.

The rivers shapefile is a simplified version of Jared Kibele and Jeanette Clark. Rivers of Alaska grouped by SASAP region, 2018. Knowledge Network for Biocomplexity. doi:10.5063/F1SJ1HVW.

``````rivers <- read_sf("data/shapefiles/ak_rivers_simp.shp")
st_crs(rivers)``````
``````## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs"``````
``````ggplot() +
geom_sf(data = ak_pop_sf, aes(fill = total_pop)) +
geom_sf(data = rivers, aes(size = StrOrder), color = "black") +
geom_sf(data = pop_sf, aes(), size = .5) +
scale_size(range = c(0.01, 0.2), guide = F) +
theme_bw() +
labs(fill = "Total Population") +
scale_fill_continuous(low = "khaki", high =  "firebrick", labels = comma)`````` ## 19.6 Visualize with leaflet

We can also make an interactive map using `leaflet`.

Leaflet (unlike ggplot) will project data for you. The catch is that you have to give it both a projection (like Alaska Albers), and that your shapefile must use a geographic coordinate system. This means that we need to use our shapefile with the 4326 EPSG code. Remember you can always check what `crs` you have set using `st_crs`.

Here we define a leaflet projection for Alaska Albers, and save it as a variable to use later.

``````epsg3338 <- leaflet::leafletCRS(
crsClass = "L.Proj.CRS",
code = "EPSG:3338",
proj4def =  "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs",
resolutions = 2^(16:7))``````

You might notice that this looks familiar! The syntax is a bit different, but most of this information is also contained within the `crs` of our shapefile:

``st_crs(ak_pop_sf)``
``````## Coordinate Reference System:
##   EPSG: 3338
##   proj4string: "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"``````

Since `leaflet` requires that we use an unprojected coordinate system, let’s use `st_transform` yet again to get back to WGS84.

``ak_pop_crs <- ak_pop_sf %>% st_transform(crs = 4326)``
``````m <- leaflet(options = leafletOptions(crs = epsg3338)) %>%