tidyr and dplyr for data cleaning

Part of being a scientist in a quantitative field is growing your ‘toolbox.’ In a broad sense, this covers two types of tools:

mathematical and statistical methods – the ways in which you approach the problem
software and programming languages – the tools you use to implement your methods

I want to spend a little bit of time talking about the software and tools that I use in my regular day-to-day workflow over the next few posts.

To clean data:

I think the ‘dplyr’ and ‘tidyr’ packages in R have started to infuse some joy into the data cleaning process for me. The simple use of piping, direct calls of variable names, easy execution of functions across ‘grouping variables’, and intuitive function names make these tools a “must-try” in my opinion.

Say you have monthly flu case data from 2010 to 2012, where rows indicate months and you have one column per year.

month	year_2010	year_2011	year_2012
January	34	42	65
…

What if you want to convert your data into the long format?

month	year	cases
January	2010	34
January	2011	42
January	2012	65

This can be done in a single line of code: simply gather and substring the year as an integer from the year variable names.

data.long <- data %>% gather(“year”, “cases”, 2:4) %>% mutate(year = as.integer(substring(year, 6, 9)))

What if you wanted to sum cases by month across all three years? Again, this can be done in a single line of code!

case.sum <- data.long %>% group_by(month) %>% summarise(cases = sum(cases))

These are just a few example cases to start. There are a number of resources for learning these tools:

Rstudio data wrangling “cheat sheet”: I have a printed copy of this on my desk for ready reference. Warning: the packages have been updated slightly since this was put out, and it’s not comprehensive, but it’s a wonderful example of a visual learning aid.
Rstudio introduction: This vignette discusses several key functions through the use of examples.

More to come!

tidyr and dplyr for data cleaning

Published by eclee25

Leave a comment Cancel reply

Share this:

Related

Published by eclee25

Leave a comment Cancel reply