18 Open Data in R and ROpenSci

This lesson was created by Bryce Mecum and Matt Jones.

18.1 Learning Objectives

In this lesson, you will:

  • Understand what open data is and why/how its useful
  • Be aware of the open data ecosystem in R
  • Become familiar with a few packages rOpenSci provides

18.2 Introduction

Getting data is a critical step in most research, yet it can sometimes be one of the most difficult and time-consuming steps. This is especially true in synthesis research, which may incorporate hundreds of thousands of datasets in the analysis.

The first report of the Open Research Data Task Force has found that two of the greatest challenges to effectively using open research data are that: even when it is notionally accessible researchers often simply cannot find that data, and if they do find it they cannot use it because of frustrating format variabilities and other compatibility issues.

From: http://www2.warwick.ac.uk/newsandevents/pressreleases/task_force_finds/

18.3 Open data

Data can come from many sources. On a continuum from least good to most good, we might have:

  • Data on a researcher’s hard drive
  • Data on institution website or FTP server
  • Data on some sort of portal behind a wall of some sort (e.g., accounts)
  • Data in an open repository (no API)
  • Data in an open repository (w/ public API)

A really great list of R packages for getting at open data can be found here:

So what is open data? Open data is data that are:

  • Properly licensed for re-use
  • Accessible w/o gates (e.g., paywall, login)
  • Use open formats (formats you can work with)

18.4 What is rOpenSci?

From https://ropensci.org/:

At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists.

Package categories:

  • Data Publication
  • Data Access
  • Literature
  • Altmetrics
  • Scalable & Reproducible Computing
  • Databases
  • Data Vizualization
  • Image Processing
  • Data Tools
  • Taxonomy
  • HTTP tools
  • Geospatial
  • Data Analysis

Full list of packages: https://ropensci.org/packages/ Many of these are on CRAN and can be installed via install.packages() but some are not. rOpenSci addresses the issues raised in that top quote.

18.5 Overview of some of the interetsing packages rOpenSci provides

Let’s go through a couple of packages sponsored by rOpenSci to demonstrate the power of open data + APIs + R.

18.5.1 rfishbase: R interface to the fishbase.org database

install.packages("fansi") #string formatting package
## Error in install.packages : Updating loaded packages
install.packages("rfishbase")
## 
## The downloaded binary packages are in
##  /var/folders/xp/vcwlmnrj5gg646dbp2c1fkh00000gq/T//Rtmp0YSLih/downloaded_packages
library(rfishbase)
fish <- c("Oreochromis niloticus", "Salmo trutta")
fish <- validate_names(c("Oreochromis niloticus", "Salmo trutta"))
fish <- species_list(Genus = "Labroides")
fish
## [1] "Labroides dimidiatus"    "Labroides bicolor"      
## [3] "Labroides pectoralis"    "Labroides phthirophagus"
## [5] "Labroides rubrolabiatus"
fish_common <- common_to_sci("trout")
fish_common
## # A tibble: 279 x 4
##    Species                   ComName              Language SpecCode
##    <chr>                     <chr>                <chr>       <int>
##  1 Salmo obtusirostris       Adriatic trout       English      6210
##  2 Schizothorax richardsonii Alawan snowtrout     English      8705
##  3 Schizopyge niger          Alghad snowtrout     English     24454
##  4 Salvelinus fontinalis     American brook trout English       246
##  5 Salmo trutta              Amu-Darya trout      English       238
##  6 Oncorhynchus apache       Apache Trout         English      2687
##  7 Oncorhynchus apache       Apache trout         English      2687
##  8 Plectropomus areolatus    Apricot trout        English      6082
##  9 Salmo trutta              Aral Sea Trout       English       238
## 10 Salmo trutta              Aral trout           English       238
## # ... with 269 more rows

18.5.2 rnoaa: R interface to many NOAA data APIs

Access data like:

  • Air temps
  • Sea ice extent
  • Buoy data
  • Tons more!

https://github.com/ropensci/rnoaa/tree/master/R

install.packages("rnoaa")
install.packages("ncdf4")
library(rnoaa)

# Go here: http://www.ndbc.noaa.gov/
# Find a station ID, like https://www.ndbc.noaa.gov/station_page.php?station=46080
buoy(dataset = 'cwind', buoyid = 46080, year = 2016, datatype = "c")
bd <- buoy(dataset = 'cwind', buoyid = 46080, year = 2016, datatype = "c")
plot(bd$data$wind_spd)

18.6 Summary

  • Open data greatly assist in the data aquisition step in research
  • Finding open data is still hard
  • R, via rOpenSci, has a lot of packages for accessing open data already available to you