class: center, middle, inverse, title-slide # Navigating R with the tidyverse and Rmarkdown ### Sean Hardison and Dr. Max Castorani
Advanced Ecological Data Analysis
August 27th, 2019 --- # Where it all begins: your data **Help your future self and collaborators succeed with simple guidelines for putting data into spreadsheets<sup>1</sup>** **Be consistent** * "Whatever you do, do it consistently" * Use consistent sheet layouts, file names, variable names, missing values, and note phrases * Allows for efficient analyses, batch processing, and reproducibility .footnote[ Broman, Karl W., and Kara H. Woo. "Data organization in spreadsheets." The American Statistician 72.1 (2018): 2-10. ] -- **Choose good names** * Use hyphens or underscores for spaces * Avoid special characters ($, @, %, &, *, (, ), !, /,., etc) * Keep names concise and meaningful * e.g. A cell with "BotSal (PSU)" could be split into "depth", "variable", and "units" columns --- # Where it all begins: your data **Use a single datetime format** * Typically ISO 8601 standard is easiest: "YYYY-MM-DD" * Pick one and stick with it .footnote[ Broman, Karl W., and Kara H. Woo. "Data organization in spreadsheets." The American Statistician 72.1 (2018): 2-10. ] -- **No empty cells** * Empty cells should either contain an indicator of missing values (e.g. `NA`) * You may think empty cells contain implied information, but this will not be obvious when you open the file in two years. * Let values repeat in spreadsheets: Excel ink is cheap -- **Data in spreadsheets should be rectangular** * **Do not** use more than one row for variable names * Avoid more than one "rectangle" of data per file. Create multiple files or sheets instead. --- # Where it all begins: your data **Spreadsheets hold your raw data** * Keep only raw or minimally processed data in spreadsheets * Data manipulations and analyses can be followed (and reproduced) with code ;) .footnote[ Broman, Karl W., and Kara H. Woo. "Data organization in spreadsheets." The American Statistician 72.1 (2018): 2-10. ] -- **Don't use cell highlighting to convey information** * Instead, add a column. For example, a column called "outlier" with values `TRUE` or `FALSE` to flag data -- **Make backups** * Google drive * Version control software like Git for small data sets -- **Save spreadsheets as CSV files** * Nonproprietary, lightweight, easy to handle with programming languages --- # One more thing: reproducibility * With minimal effort anyone should be able to run your R script and reproduce your results -- #### Good practices for reproducibility * Each data analysis project should have a unique directory structure * Keep it simple: `/R`, `/data`, `/images`, ... * R project files (`.Rproj`) + the `here` R package * Use Rmarkdown (we'll get to it) * Use version control software (we'll get to it) --- # The Tidyverse >"*An opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.*" * Includes `dplyr`, `ggplot2`, `stringr`, `lubridate`... * Developed by Hadley Wickham <br /> -- ## Why use the tidyverse? * Unified framework for data cleaning, analysis, and visualization * A more intuitive R syntax that "reads" * Helps others understand your work * Helps **you** understand your work Work smarter, not harder! .footnote[ [https://r4ds.had.co.nz/](R for Data Science) </br> [https://github.com/tidyverse](Explore all tidyverse packages) ] --- # Set up for today (I) ## Get the Virginia Coast Reserve End of Year Biomass data 1. Download and unzip the repository here: [https://github.com/EVSC-5040/lecture]() <img src="images/get_data.png" width="80%" style="display: block; margin: auto;" /> .footnote[ [Learn more about the VCR End of Year Biomass Data Set](http://www.vcrlter.virginia.edu/cgi-bin/showDataset.cgi?docid=knb-lter-vcr.167) ] --- # Set up for today (II) 1. Open the file `evsc5040.Rproj` (defaults to RStudio) 2. Install the `here` R package `install.packages('here')` 3. Open a new .R script and load the data set ```r # load packages library(tidyverse) library(here) # import data (using the readr and here packages) eoyb <- read_csv(here("data","EOYB_data.csv")) ``` --- # Examples ## Filtering with `dplyr::filter()` `filter(data.frame, row_condition1, row_condition2,...)` Find observations where dry mass (`liveMass`) is greater than 6 g/0.625 m<sup>2</sup> quadrat ```r dplyr::filter(eoyb, liveMass > 6) ``` Return *Spartina alterniflora* observations **and** where `liveMass > 6` ```r dplyr::filter(eoyb, liveMass > 6, speciesName == "Spartina alterniflora") ``` Return observations of either *S. alterniflora* **or** *Distichlis spicata* ```r # Using the %in% operator dplyr::filter(eoyb, speciesName %in% c("Spartina alterniflora", "Distichlis spicata")) # Using the | operator ('or') dplyr::filter(eoyb, (speciesName == "Distichlis spicata") | (speciesName == "Spartina alterniflora")) ``` --- # Examples ## Selecting columns with `dplyr::select()` Select a single column ```r dplyr::select(eoyb, speciesName) ``` Select multiple columns ```r dplyr::select(eoyb, speciesName, liveMass) ``` Select a range of columns ```r dplyr::select(eoyb, EOYBYear:liveMass) ``` Drop a column ```r dplyr::select(eoyb, -collectDate) ``` --- # Examples ## Linking your workflows with pipe operators (` %>% `) Filter on condition and then select a range of columns in one step using piping ```r eoyb_spartina <- eoyb %>% dplyr::filter(speciesName == "Spartina alterniflora") %>% dplyr::select(-collectDate, -sortDate) ``` ## Use `dplyr::group_by()` with `dplyr::summarise()` to find summary statistics In 2007, what was the mean amount of live *S. alterniflora* biomass within each location (`locationName`)? ```r eoyb_spartina %>% filter(EOYBYear == 2007) %>% group_by(locationName) %>% dplyr::summarise(mean_biomass = mean(liveMass, na.rm = T)) ``` What about across years and locations? ```r eoyb_spartina %>% group_by(EOYBYear, locationName) %>% dplyr::summarise(mean_biomass = mean(liveMass, na.rm = T)) ``` --- # Examples ## Adding columns with `dplyr::mutate()` Create a new column showing the absolute difference of `liveMass` and `deadMass` ```r eoyb_spartina %>% mutate(mass_dif = abs(liveMass - deadMass)) ``` Create a new column where row values are conditional (one) ```r eoyb_spartina %>% mutate(live_below_dead = if_else(liveMass < deadMass, "More dead" , "More live" )) ``` Create or manipulate existing row values ```r # unique(eoyb$transect) eoyb %>% mutate(Transect = if_else(Transect %in% c("Hog Island North","Hog Island South"), "Hog Island" ,Transect)) ``` --- # A brief introduction to `stringr` * All functions start with `str_` ## Examples Detect a pattern in a string ```r string <- "Welcome to EVSC5040! Make your life easier with the tidyverse" str_detect(string, "tidyverse") ``` ``` ## [1] TRUE ``` Replace a specific pattern ```r #replace "life" with "experience" str_replace(string, "life", "experience") ``` ``` ## [1] "Welcome to EVSC5040! Make your experience easier with the tidyverse" ``` .footnote[ [Read more about stringr here](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) </br> [Troubleshoot your regex here](regex101.com) ] --- # More `stringr` Extract a string based on a regular expression ```r # Match all characters (.) 0 to an unlimited number of times (*) str_extract(string, ".*") ``` ``` ## [1] "Welcome to EVSC5040! Make your life easier with the tidyverse" ``` ```r # Match digits (\\d) 1 to any number of times (+) # \\d, \\w, \\b, ... represent specific character strings (e.g. a digit, a word, and space enclosing a word) str_extract(string, "\\d+") ``` ``` ## [1] "5040" ``` Data cleaning in `eoyb` with `mutate()` and `stringr` ```r eoyb_clean <- eoyb %>% mutate(speciesName = str_replace(speciesName, "^[A-Z].*\\s", paste0(str_sub(speciesName, start = 0, end = 1),". ") ) ) ``` .footnote[ [Read more about stringr here](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) </br> [Troubleshoot your regex here](regex101.com) ] --- # `ggplot2`: The Grammar of Graphics Inputs: * A ggplot function (`ggplot()`) containing a `data.frame` with aesthetics (`aes()`) mapped to variables (e.g. `x`, `y`, `color`,...) * A visual layer (`geom_line()`, `geom_point()`, ...) ```r library(ggplot2) ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + geom_point() ``` <img src="20190827-intro-to-tidy_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ```r # is the same as # ggplot() + # geom_point(data = mpg, aes(x = displ, y = hwy, color = class)) #is the same as # mpg %>% # ggplot() + # geom_point(aes(x = displ, y = hwy, color = class)) ``` --- # More `ggplot2` ```r hog_spartina <- eoyb_spartina %>% filter(marshName == "Hog Island South", EOYBYear == 2015) bxplt <- ggplot(data = hog_spartina, aes(x = locationName, y = liveMass)) + geom_boxplot() bxplt_pts <- ggplot(data = hog_spartina, aes(x = locationName, y = liveMass)) + geom_boxplot() + geom_point() bxplt_color <- ggplot(data = hog_spartina, aes(x = locationName, y = liveMass, color = locationName)) + geom_boxplot() bxplt_fill <- ggplot(data = hog_spartina, aes(x = locationName, y = liveMass, fill = locationName)) + geom_boxplot() ``` --- # More `ggplot2`: Facets ```r eoyb_spartina %>% group_by(EOYBYear, marshName, locationName) %>% dplyr::summarise(mean_live_mass = mean(liveMass, na.rm = T)) %>% ggplot() + geom_line(aes(x = EOYBYear, y = mean_live_mass, color = locationName, group = locationName)) + facet_wrap(.~marshName, ncol = 3) ``` --- # More `ggplot2`: Facets <img src="20190827-intro-to-tidy_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" />