Posts

Code

tidyr and dplyr for data cleaning

Part of being a scientist in a quantitative field is growing your ‘toolbox.’ In a broad sense, this covers two types of tools:

  • mathematical and statistical methods – the ways in which you approach the problem
  • software and programming languages – the tools you use to implement your methods

I want to spend a little bit of time talking about the software and tools that I use in my regular day-to-day workflow over the next few posts.

To clean data:

I think the ‘dplyr’ and ‘tidyr’ packages in R have started to infuse some joy into the data cleaning process for me. The simple use of piping, direct calls of variable names, easy execution of functions across ‘grouping variables’, and intuitive function names make these tools a “must-try” in my opinion.

Say you have monthly flu case data from 2010 to 2012, where rows indicate months and you have one column per year.

month year_2010 year_2011 year_2012
January 34 42 65
 …

What if you want to convert your data into the long format?

month year cases
January 2010 34
January 2011 42
January 2012 65

This can be done in a single line of code: simply gather and substring the year as an integer from the year variable names.

data.long <- data %>% gather(“year”, “cases”, 2:4) %>% mutate(year = as.integer(substring(year, 6, 9)))

What if you wanted to sum cases by month across all three years? Again, this can be done in a single line of code!

case.sum <- data.long %>% group_by(month) %>% summarise(cases = sum(cases))


These are just a few example cases to start. There are a number of resources for learning these tools:

  • Rstudio data wrangling “cheat sheet”: I have a printed copy of this on my desk for ready reference. Warning: the packages have been updated slightly since this was put out, and it’s not comprehensive, but it’s a wonderful example of a visual learning aid.
  • Rstudio introduction: This vignette discusses several key functions through the use of examples.

More to come!

Current Events

The role of scientists

Today in lab meeting, we took some time to talk about the ongoing Ebola virus disease (EVD) outbreak in West Africa and U.S. media coverage of the EVD cases in the United States. One issue that we discussed was the role of the scientific community in communicating accurate knowledge about the disease. I’ve seen news articles that suggest mis-information about the following two questions (my responses in italics):

  • Can EVD be transmitted like an airborne respiratory virus (eg. flu)? No, not that we’re aware of. Contact with bodily fluids is necessary of an infected patient is necessary.
  • Can EVD be transmitted by an individual who is not yet symptomatic (eg. already infected but not yet showing symptoms of that infection)? No, not that we’re aware of. We do not believe that they are shedding virus at this point in the course of disease.

I have a few questions of my own:

  • Why does EVD capture the imagination of the American news media in the manner of a panic-induced frenzy?
  • How can scientists inject nuance into the popular discourse about this virus and this disease?

I want to be cautious and responsible when I respond to questions about the way EVD is transmitted, but what if caution (See above — “No, not that we’re aware of”) incites unnecessary panic? Is it better to appear authoritative and neglect to include caveats to a statement? (“You can only catch the disease if you’re touching bodily fluids!”) I’m not sure which solution improves the situation. Is it the role of scientists to state facts (“We don’t know everything about EVD, but we believe it may only be transmitted through direct contact with the bodily fluids of an infected individual.“) or to divert the capricious attention of individuals to problems that are more likely to impact them (“You’re more likely to get flu than Ebola. Get your flu shot!”)? Another way to consider the role of scientists: Why is it important to inform the public, especially if their risk is low? Shouldn’t we spend more time educating health-care professionals that do have higher risks of contracting the virus?

I don’t know the best way to consider these questions, but I think that one role for scientists is to distill information from the original scientific literature. Studies are good at reporting their findings and obscuring their limitations. For those of us in the infectious disease field, we can proactively contribute to public knowledge about EVD through open discussion of existing knowledge. We can voice our opinions about a study’s validity and limitations while highlighting the utility of its findings.

For now, I’ll leave with a short compilation of references and links about the current outbreak, in no particular order.

  • Bellan, et al. (2014). Ebola control: effect of asymptomatic infection and acquired immunity. Lancet, Early Online Publication, October 16. doi: 10.1016/S0140-6736(14)61839-0
  • WHO Ebola Response Team (2014). Ebola Virus Disease in West Africa — The First 9 Months of the Epidemic and Forward Projections. New England Journal of Medicine 371, 1481-1495. doi: 10.1056/NEJMoa1411100
  • Gire, et al. (2014). Genomic surveillance elucidates Ebola virus origin and transmission during the
    2014 outbreak. Science 345 (6202), 1369-1372. doi: 10.1126/science.1259657
  • Towers, et al. (2014). Temporal Variations in the Effective Reproduction Number of the 2014 West Africa Ebola Outbreak. PLOS Currents Outbreaks, Sep 18. Edition 1. doi: 10.1371/currents.outbreaks.9e4c4294ec8ce1adad283172b16bc908.
  • Fisman, et al. (2014). Early Epidemic Dynamics of the West African 2014 Ebola Outbreak:
    Estimates Derived with a Simple Two-Parameter Model. PLOS Currents Outbreaks, Sep 8.
    Edition 1. doi: 10.1371/currents.outbreaks.89c0d3783f36958d96ebbae97348d571.
  • Althaus CL. (2014). Estimating the Reproduction Number of Ebola Virus (EBOV) During the 2014 Outbreak
    in West Africa. PLOS Currents Outbreaks, Sep 2. Edition 1. doi:
    10.1371/currents.outbreaks.91afb5e0f279e7f29e7056095255b288.

Surveillance data and projections:

Thoughts my own.

News

echo ‘Hello world!’

I am a graduate student in a program called ‘Global Infectious Disease,’ but what does that really mean? Some of my colleagues work in wet labs studying the pathogenesis and host immune response to human disease causing viruses, bacteria, and parasites. Others, like me, study the epidemiology and public health implications of different diseases through statistical and population-based approaches. All of us have an interest in interdisciplinary applications of our research.

Right now, I’m interested in examining the disease dynamics of influenza, a common and seasonal disease with far-reaching consequences. While I enjoy delving into the data, I don’t want to lose sight of the bigger picture — that is, the reason I care about infectious diseases in the first place. I want to inform public health and policy decision makers about the important infectious disease issues. I want to develop expertise in both mathematical biology for my research and science communication for the public. I want to be a filter that distinguishes scientific fact from hearsay, that can explain not only ‘what’, but ‘why’ and ‘how’ when it’s needed.

This blog is a first step. I want this blog to become a space for open discussion on news and issues related to infectious diseases and the use of quantitative methods in disease ecology. Along the way, I hope to use this as a means to develop and refine my scientific and non-scientific writing voice.

Join me on this journey!