I came across this R package on GitHub, and it made me so excited that I decided to write a post about it. It’s a compilation by Karl Broman of various R functions that he’s found helpful to write throughout the years.
Wouldn’t it be great if incoming graduate students in Biostatistics/Statistics were taught to create a personal repository of functions like this? Not only is it a great way to learn how to write an R package, but it also encourages good coding techniques for newer students (since it encourages them to write separate functions with documentation). It also allows for easy reprodicibility and collaboration both within the school and with the broader community. Case in point — I wanted to use one of Karl’s functions (which I found via his blog… which I found via Twitter), and all I had to do was run:
(Note that install_github is a function in the devtools package. I would link to the GitHub page for that package but somehow that seems circular…)
For whatever reason, when I think of R packages, I think of big, unified projects with a specified scientific aim. This was a great reminder that R packages exist solely for making it easier to distribute code for any purpose. Distributing tips and tricks is certainly a worthy purpose!
The advantage about writing a blog post about the tools you wish that you’d used throughout grad school is that, well, it makes you check them out. I went through the ProjectTemplate tutorial, and I’m hooked. Here’s the advantages as I see them:
- Routine is your friend. This could really go for everything in your life. Small decisions contribue to decision fatigue, even if it’s something as simple as where to put a file. By automating as much as possible, you’re allowing yourself to save your finite willpower for real work instead of grunt work.
- It’s easier to start somewhere and then customize, rather than start from the ground up. After four years in grad school, I have a system that I’ve hacked together for how to organize my analyses, but I would have rather not put the energy into creating the system in the first place. Designing a good system takes up a surprising amount of brain space, whereas modifying one takes much less. And since the author of ProjectTemplate seems to know what he’s doing, I doubt I’ll modify much.
- Reproducibility should be as easy as possible. The way it works, ProjectTemplate makes it very easy to include (but not re-run) the code that you have for preprocessing the data, or other steps that you might only perform once during an analysis. And since reproducibility is such an important aspect of the scientific process, it should be as easy as possible.
- Finding things should also be as easy as possible. This is quite similar to reproducibility, but on the individual level. I go back to old analyses all the time to borrow code, which can be extremely frustrating to me since some of my older analyses aren’t well organized (see #2). So it’s nice that you’ll know exactly where you placed something, because you have a uniform system in place.
Just as an aside, I get the impression from the computer scientists I’ve talked to that they don’t necessarily learn “how to code” in coursework, either, but are also expected to develop a system on their own. This I don’t understand, and perhaps when schooling catches up to the computer era we’ll see a change. For example, in high school I learned the five-paragraph format for writing essays, even though very few professional essayists use the format in publications. But it’s still a solid foundation for expression, which you can stray from as you become more confident in your abilities and command of the process. I suppose this argument requires that coding be taught in high school, but that’s another thing I’d love to see. One day!