18 Open Data in R and ROpenSci
This lesson was created by Bryce Mecum and Matt Jones.
18.1 Learning Objectives
In this lesson, you will:
- Understand what open data is and why/how its useful
- Be aware of the open data ecosystem in R
- Become familiar with a few packages rOpenSci provides
18.2 Introduction
Getting data is a critical step in most research, yet it can sometimes be one of the most difficult and time-consuming steps. This is especially true in synthesis research, which may incorporate hundreds of thousands of datasets in the analysis.
The first report of the Open Research Data Task Force has found that two of the greatest challenges to effectively using open research data are that: even when it is notionally accessible researchers often simply cannot find that data, and if they do find it they cannot use it because of frustrating format variabilities and other compatibility issues.
From: http://www2.warwick.ac.uk/newsandevents/pressreleases/task_force_finds/
18.3 Open data
Data can come from many sources. On a continuum from least good to most good, we might have:
- Data on a researcher’s hard drive
- Data on institution website or FTP server
- Data on some sort of portal behind a wall of some sort (e.g., accounts)
- Data in an open repository (no API)
- Data in an open repository (w/ public API)
A really great list of R packages for getting at open data can be found here:
So what is open data? Open data is data that are:
- Properly licensed for re-use
- Accessible w/o gates (e.g., paywall, login)
- Use open formats (formats you can work with)
18.4 What is rOpenSci?
From https://ropensci.org/:
At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists.
Package categories:
- Data Publication
- Data Access
- Literature
- Altmetrics
- Scalable & Reproducible Computing
- Databases
- Data Vizualization
- Image Processing
- Data Tools
- Taxonomy
- HTTP tools
- Geospatial
- Data Analysis
Full list of packages: https://ropensci.org/packages/ Many of these are on CRAN and can be installed via install.packages()
but some are not. rOpenSci addresses the issues raised in that top quote.
18.5 Overview of some of the interetsing packages rOpenSci provides
Let’s go through a couple of packages sponsored by rOpenSci to demonstrate the power of open data + APIs + R.
18.5.1 rfishbase
: R interface to the fishbase.org database
- Website: http://www.fishbase.org/search.php
- Package: https://github.com/ropensci/rfishbase
- The New RFishbase: https://cran.r-project.org/web/packages/rfishbase/vignettes/tutorial.html
install.packages("fansi") #string formatting package
## Error in install.packages : Updating loaded packages
install.packages("rfishbase")
##
## The downloaded binary packages are in
## /var/folders/xp/vcwlmnrj5gg646dbp2c1fkh00000gq/T//Rtmp0YSLih/downloaded_packages
library(rfishbase)
fish <- c("Oreochromis niloticus", "Salmo trutta")
fish <- validate_names(c("Oreochromis niloticus", "Salmo trutta"))
fish <- species_list(Genus = "Labroides")
fish
## [1] "Labroides dimidiatus" "Labroides bicolor"
## [3] "Labroides pectoralis" "Labroides phthirophagus"
## [5] "Labroides rubrolabiatus"
fish_common <- common_to_sci("trout")
fish_common
## # A tibble: 279 x 4
## Species ComName Language SpecCode
## <chr> <chr> <chr> <int>
## 1 Salmo obtusirostris Adriatic trout English 6210
## 2 Schizothorax richardsonii Alawan snowtrout English 8705
## 3 Schizopyge niger Alghad snowtrout English 24454
## 4 Salvelinus fontinalis American brook trout English 246
## 5 Salmo trutta Amu-Darya trout English 238
## 6 Oncorhynchus apache Apache Trout English 2687
## 7 Oncorhynchus apache Apache trout English 2687
## 8 Plectropomus areolatus Apricot trout English 6082
## 9 Salmo trutta Aral Sea Trout English 238
## 10 Salmo trutta Aral trout English 238
## # ... with 269 more rows
18.5.2 rnoaa
: R interface to many NOAA data APIs
Access data like:
- Air temps
- Sea ice extent
- Buoy data
- Tons more!
https://github.com/ropensci/rnoaa/tree/master/R
install.packages("rnoaa")
install.packages("ncdf4")
library(rnoaa)
# Go here: http://www.ndbc.noaa.gov/
# Find a station ID, like https://www.ndbc.noaa.gov/station_page.php?station=46080
buoy(dataset = 'cwind', buoyid = 46080, year = 2016, datatype = "c")
bd <- buoy(dataset = 'cwind', buoyid = 46080, year = 2016, datatype = "c")
plot(bd$data$wind_spd)
18.6 Summary
- Open data greatly assist in the data aquisition step in research
- Finding open data is still hard
- R, via rOpenSci, has a lot of packages for accessing open data already available to you