Part of being a scientist in a quantitative field is growing your ‘toolbox.’ In a broad sense, this covers two types of tools:
- mathematical and statistical methods – the ways in which you approach the problem
- software and programming languages – the tools you use to implement your methods
I want to spend a little bit of time talking about the software and tools that I use in my regular day-to-day workflow over the next few posts.
To clean data:
I think the ‘dplyr’ and ‘tidyr’ packages in R have started to infuse some joy into the data cleaning process for me. The simple use of piping, direct calls of variable names, easy execution of functions across ‘grouping variables’, and intuitive function names make these tools a “must-try” in my opinion.
Say you have monthly flu case data from 2010 to 2012, where rows indicate months and you have one column per year.
| month | year_2010 | year_2011 | year_2012 |
| January | 34 | 42 | 65 |
| … |
What if you want to convert your data into the long format?
| month | year | cases |
| January | 2010 | 34 |
| January | 2011 | 42 |
| January | 2012 | 65 |
This can be done in a single line of code: simply gather and substring the year as an integer from the year variable names.
data.long <- data %>% gather(“year”, “cases”, 2:4) %>% mutate(year = as.integer(substring(year, 6, 9)))
What if you wanted to sum cases by month across all three years? Again, this can be done in a single line of code!
case.sum <- data.long %>% group_by(month) %>% summarise(cases = sum(cases))
These are just a few example cases to start. There are a number of resources for learning these tools:
- Rstudio data wrangling “cheat sheet”: I have a printed copy of this on my desk for ready reference. Warning: the packages have been updated slightly since this was put out, and it’s not comprehensive, but it’s a wonderful example of a visual learning aid.
- Rstudio introduction: This vignette discusses several key functions through the use of examples.
More to come!