17 Open Data in R and ROpenSci
17.1 Learning Objectives
In this lesson, you will:
- Understand what open data is and why/how its useful
- Be aware of the open data ecosystem in R
- Become familiar with a few packages rOpenSci provides
Getting data is a critical step in most research, yet it can sometimes be one of the most difficult and time-consuming steps. This is especially true in synthesis research, which may incorporate hundreds of thousands of datasets in the analysis.
The first report of the Open Research Data Task Force has found that two of the greatest challenges to effectively using open research data are that: even when it is notionally accessible researchers often simply cannot find that data, and if they do find it they cannot use it because of frustrating format variabilities and other compatibility issues.
17.3 Open data
Data can come from many sources. On a continuum from least good to most good, we might have:
- Data on a researcher’s hard drive
- Data on institution website or FTP server
- Data on some sort of portal behind a wall of some sort (e.g., accounts)
- Data in an open repository (no API)
- Data in an open repository (w/ public API)
A really great list of R packages for getting at open data can be found here:
So what is open data? Open data is data that are:
- Properly licensed for re-use
- Accessible w/o gates (e.g., paywall, login)
- Use open formats (formats you can work with)
17.4 What is rOpenSci?
At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists.
- Data Publication
- Data Access
- Scalable & Reproducible Computing
- Data Vizualization
- Image Processing
- Data Tools
- HTTP tools
- Data Analysis
Full list of packages: https://ropensci.org/packages/
Many of these are on CRAN and can be installed via
install.packages() but some are not.
rOpenSci addresses the issues raised in that top quote.
17.5 Overview of some of the interetsing packages rOpenSci provides
Let’s go through a couple of packages sponsored by rOpenSci to demonstrate the power of open data + APIs + R.
rnoaa: R interface to many NOAA data APIs
Access data like:
- Air temps
- Sea ice extent
- Buoy data
- Tons more!
install.packages("rnoaa") # run once install.packages("ncdf4") # run once
Go here: http://www.ndbc.noaa.gov/ Find a station ID, like https://www.ndbc.noaa.gov/station_page.php?station=46080
bd <- buoy(dataset = 'cwind', buoyid = 46080, year = 2018, datatype = "c") wind_speed <- data.frame(time = as.POSIXct(bd$data$time), speed = bd$data$wind_spd, stringsAsFactors = F)
ggplot(wind_speed, aes(x = time, y = speed)) + geom_point() + xlab("") + ylab("Wind Speed (m/s)") + ggtitle("2018 NOAA buoy observations near Kodiak, AK") + theme_bw()
mapr: Mapping Species Occurrence Data
install.packages("mapr", dependencies = TRUE) # run once
Plot the locations of GBIF species occurrence data for grizzly and polar bears, with different colors for each species.
spp <- c('Ursus arctos', 'Ursus maritimus') dat <- occ(query = spp, from = 'gbif', has_coords = TRUE, limit = 500)
map_leaflet(dat, size = 1, color = c("brown", "gray"), alpha = 1)
- Open data greatly assist in the data acquisition step in research
- Finding open data is still hard
- R, via rOpenSci, has a lot of packages for accessing open data already available to you