Writing an R package from scratch

As I have worked on various projects at Etsy, I have accumulated a suite of functions that help me quickly produce tables and charts that I find useful. Because of the nature of iterative development, it often happens that I reuse the functions many times, mostly through the shameful method of copying the functions into the project directory. I have been a fan of the idea of personal R packages for a while, but it always seemed like A Project That I Should Do Someday and someday never came. Until…

Etsy has an amazing week called “hack week” where we all get the opportunity to work on fun projects instead of our regular jobs. I sat down yesterday as part of Etsy’s hack week and decided “I am finally going to make that package I keep saying I am going to make.” It took me such little time that I was hit with that familiar feeling of the joy of optimization combined with the regret of past inefficiencies (joygret?). I wish I could go back in time and create the package the first moment I thought about it, and then use all the saved time to watch cat videos because that really would have been more productive.

This tutorial is not about making a beautiful, perfect R package. This tutorial is about creating a bare-minimum R package so that you don’t have to keep thinking to yourself, “I really should just make an R package with these functions so I don’t have to keep copy/pasting them like a goddamn luddite.” Seriously, it doesn’t have to be about sharing your code (although that is an added benefit!). It is about saving yourself time. (n.b. this is my attitude about all reproducibility.)

(For more details, I recommend this chapter in Hadley Wickham’s Advanced R Programming book.)

Step 0: Packages you will need
The packages you will need to create a package are devtools and roxygen2. I am having you download the development version of the roxygen2 package.

install.packages("devtools")
library("devtools")
devtools::install_github("klutometis/roxygen")
library(roxygen2)

Step 1: Create your package directory
You are going to create a directory with the bare minimum folders of R packages. I am going to make a cat-themed package as an illustration.

setwd("parent_directory")
create("cats")

If you look in your parent directory, you will now have a folder called cats, and in it you will have two folders and one file called DESCRIPTION.

You should edit the DESCRIPTION file to include all of your contact information, etc.

Step 2: Add functions
If you’re reading this, you probably have functions that you’ve been meaning to create a package for. Copy those into your R folder. If you don’t, may I suggest something along the lines of:

cat_function <- function(love=TRUE){
    if(love==TRUE){
        print("I love cats!")
    }
    else {
        print("I am not a cool person.")
    }
}

Save this as a cat_function.R to your R directory.

(cats-package.r is auto-generated when you create the package.)

Step 3: Add documentation
This always seemed like the most intimidating step to me. I’m here to tell you — it’s super quick. The package roxygen2 that makes everything amazing and simple. The way it works is that you add special comments to the beginning of each function, that will later be compiled into the correct format for package documentation. The details can be found in the roxygen2 documentation — I will just provide an example for our cat function.

The comments you need to add at the beginning of the cat function are, for example, as follows:

#' A Cat Function
#'
#' This function allows you to express your love of cats.
#' @param love Do you love cats? Defaults to TRUE.
#' @keywords cats
#' @export
#' @examples
#' cat_function()

cat_function <- function(love=TRUE){
    if(love==TRUE){
        print("I love cats!")
    }
    else {
        print("I am not a cool person.")
    }
}

I’m personally a fan of creating a new file for each function, but if you’d rather you can simply create new functions sequentially in one file — just make sure to add the documentation comments before each function.

Step 4: Process your documentation
Now you need to create the documentation from your annotations earlier. You’ve already done the “hard” work in Step 3. Step 4 is as easy doing this:

setwd("./cats")
document()

This automatically adds in the .Rd files to the man directory, and adds a NAMESPACE file to the main directory. You can read up more about these, but in terms of steps you need to take, you really don’t have to do anything further.

(Yes I know my icons are inconsistent. Yes I tried to fix that.)

Step 5: Install!
Now it is as simple as installing the package! You need to run this from the parent working directory that contains the cats folder.

setwd("..")
install("cats")

Now you have a real, live, functioning R package. For example, try typing ?cat_function. You should see the standard help page pop up!

(Bonus) Step 6: Make the package a GitHub repo
This isn’t a post about learning to use git and GitHub — for that I recommend Karl Broman’s Git/GitHub Guide. The benefit, however, to putting your package onto GitHub is that you can use the devtools install_github() function to install your new package directly from the GitHub page.

install_github('cats','github_username')

Step 7-infinity: Iterate
This is where the benefit of having the package pulled together really helps. You can flesh out the documentation as you use and share the package. You can add new functions the moment you write them, rather than waiting to see if you’ll reuse them. You can divide up the functions into new packages. The possibilities are endless!

Additional pontifications: If I have learned anything from my (amazing and eye-opening) first year at Etsy, it’s that the best products are built in small steps, not by waiting for a perfect final product to be created. This concept is called the minimum viable product — it’s best to get a project started and improve it through iteration. R packages can seem like a big, intimidating feat, and they really shouldn’t be. The minimum viable R package is a package with just one function!

Additional side-notes: I learned basically all of these tricks at the rOpenSci hackathon. My academic sister Alyssa wrote a blog post describing how great it was. Hadley Wickham gets full credit for envisioning that R packages should be the easiest way to share code, and making functions/resources that make it so easy to do so.

Personal R Packages

I came across this R package on GitHub, and it made me so excited that I decided to write a post about it. It’s a compilation by Karl Broman of various R functions that he’s found helpful to write throughout the years.

Wouldn’t it be great if incoming graduate students in Biostatistics/Statistics were taught to create a personal repository of functions like this? Not only is it a great way to learn how to write an R package, but it also encourages good coding techniques for newer students (since it encourages them to write separate functions with documentation). It also allows for easy reprodicibility and collaboration both within the school and with the broader community. Case in point — I wanted to use one of Karl’s functions (which I found via his blog… which I found via Twitter), and all I had to do was run:

install_github('broman','kbroman')
library('broman')

(Note that install_github is a function in the devtools package. I would link to the GitHub page for that package but somehow that seems circular…)

For whatever reason, when I think of R packages, I think of big, unified projects with a specified scientific aim. This was a great reminder that R packages exist solely for making it easier to distribute code for any purpose. Distributing tips and tricks is certainly a worthy purpose!

About that p-value article…

Last night on twitter there was a bit of a firestorm over this New York Times snippet about p-values (here is my favorite twitter-snark response). While the article has a surprising number of controversial sentences for only 180 words, the most offending sentence is:

By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.

One problem with this sentence is that it commits a statistical cardinal sin by stating the equivalent of “the null hypothesis is probably true.” The correct interpretation for a p-value greater than 0.05 is “we cannot reject the null hypothesis” which can mean many things (for example, we did not collect enough data).

Another problem I have with the sentence is that the phrase “however good or bad” is incredibly misleading — it’s like saying that even if you see a fantastically big result with low variance, you still might call it “due to chance.” The idea of a p-value is that it’s a way of defining good or bad. Even if there’s a “big” increase or decrease in an outcome, is it really meaningful if there’s bigger variance in that change? (No.)

I’d hate to be accused of being an armchair critic, so here is my attempt to convey the meaning/importance of p-values to a non-stat, non-math audience. I think the key to having a good discussion of this is to provide more philosophical background (and I think that’s why these discussion are always so difficult). Yes this is over three times the word count of the original NYTimes article, but fortunately the internet has no word limits.

(n.b. I don’t actually hate to be accused of being an armchair critic. Being an armchair critic is the best part of tweeting, by far.)

(Also fair warning — I am totally not opening the “should we even be using the p-value” can of worms.)

Putting a Value to ‘Real’ in Medical Research

by Hilary Parker

In the US judicial system, a person on trial is “innocent until proven guilty.” Similarly, in a clinical trial for a new drug, the drug must be assumed ineffective until “proven” effective. And just like in the courts, the definition of “proven” is fuzzy. Jurors must believe “beyond a reasonable doubt” that someone is guilty to convict. In medicine, the p-value is one attempt at summarizing whether or not a drug has been shown to be effective “beyond a reasonable doubt.” Medicine is set up like the court system for good reason — we want to avoid claiming that ineffective drugs are effective, just like we want to avoid locking up innocent people.

To understand whether or not a drug has been shown to be effective “beyond a reasonable doubt” we must understand the purpose of a clinical trial. A clinical trial provides an understanding of the possible improvements that a drug can cause in two ways: (1) it shows the average improvement that the drug gives, and (2) it accounts for the variance in that average improvement (which is determined by the number of patients in the trial as well as the true variation in the improvements that the drug causes).

Why are both (1) and (2) necessary? Think about it this way — if I were to tell you that the price between two books was $10, would you think that’s a big price difference, or a small one? If it were the difference in price between two paperback books, it’d be a pretty big difference since paperback books are usually under $20. However if it were the difference in price between two hardcover books, it’d be a smaller difference in prices, since hardcover books vary more in price, in part because they are more expensive. And if it were the difference in price between two ebook versions of printed books, that’d be a huge difference, since ebook prices (for printed books) are quite stable at the $9.99 mark.

We intuitively understand what a big difference is for different types of books because we’ve contextualized them — we’ve been looking at book prices for years, and understand the variation in the prices. With the effectiveness of a new drug in a clinical trial, however, we don’t have that context. Instead of looking at the price of a single book, we’re looking at the average improvement that a drug causes — but either way, the number is meaningless without knowing the variance. Therefore, we have to determine (1) the average improvement, and (2) the variance of the average improvement, and use both of these quantities to determine whether the drug causes a “good” enough average improvement in the trial to call the drug effective. The p-value is simply a way of summarizing this conclusion quickly.

So, to put things back in terms of the p-value: let’s say that someone reports that NewDrug is shown to increase healthiness points by 10 points on average, with a p-value of 0.01. The p-value provides some context for whether or not an average increase of 10 points in this trial is “good” enough for the drug to be called effective, and is calculated by looking at both the average improvement, and the variance in improvements for different patients in the trial (while also controlling for the number of people in the trial). The correct way to interpret a p-value of 0.01 is: “If in reality NewDrug is NOT effective, then the probability of seeing an average increase in healthiness points at least this big if we repeated this trial is 1% (0.01*100).” The convention is that that’s a low enough probability to say that NewDrug has been shown to be effective “beyond a reasonable doubt” since it is less than 5%. (Many, many statisticians loathe this cut-off, but it is the standard for now.)

A key thing to understand about the p-value in a clinical trial is that a p-value greater than 0.05 doesn’t “prove” that NewDrug is ineffective — it just means that NewDrug wasn’t shown to be effective in the clinical trial. We can all think of examples of court cases where the person was probably guilty, but was not convicted (OJ Simpson, anyone?). Similarly with the p-value, if a trial reports a p-value greater than 0.05, it doesn’t “prove” that the drug is ineffective. It just means that researchers failed to show that the drug was effective “beyond a reasonable doubt” in the trial. Perhaps the drug really is ineffective, or perhaps the researchers simply did not gather enough samples to make a convincing case.

edit: In case you’re doubting my credentials regarding p-values, I’ll just leave this right here…

More derby wife gushing

My derby wife Logistic Aggression (aka Hanna Wallach) posted a fantastic essay (via her blog) about things she’s learned from roller derby. Some points are pretty derby specific, and some are more broadly applicable. Points 4 and 5 – about “growth mindset” and “grit” – are especially worth a look for non-roller-girls, and apply directly to challenges in grad school. Kate Clancy, an Anthropology professor and roller girl in Illinois, has also written a series of awesome, insightful essays about the connection between derby and academics. I’ve always had a soft spot in my heart for discussions about the parallels between sports and real life. But there’s been something about doing such an intense organized sport outside of the traditional time to do so that has made it a much more enlightening experience.

Basically if you want to figure out who the really cool, insightful professors are, look for the roller girls! (No offense to the male professors of course – but there is male roller derby so no excuses!)

edit 3/13/13: Logistic Aggression’s essay has been featured in DerbyLife!

Editing papers using text-to-speech software

I spent a large chunk of today editing a paper that I’m submitting for publication this week, which has inspired me to share one of my favorite tricks for editing in general. Inspired by this ProfHacker article last year, I started using a text-to-speech tool to “read” my papers to me for the final look-through.

This has the huge advantage of being completely objective, slowly-paced, and able to pick up things like doubled words or missing modifiers. There are a ton of tools out there, but I am partial to the Announcify Chrome extension, which has a nice pause button and is easy to use. Plus I like cloud software way better than desktop software, and I sometimes use it to read online articles well.

My (extremely hacky) workflow for this is as follows. Note this works especially nicely if you have two screens.

Copy LaTeX text into a Google Doc, and publish the document to the web (under File).
Read the published document using Announcify, skipping through tables and long equations.
Edit document in Texmaker, pausing Announcify if necessary.

There was also an article about how you can use the built-in software in a Kindle to edit long documents, presumably while lounging on your couch sipping hot cocoa.

GradHacker

Via twitter I’ve found a new blog, GradHacker, that I’m already in love with. The archives are full of great articles that are refreshingly honest about the challenges many face in grad school, which is something that can be difficult to discuss in person since just about everyone is dealing with imposter syndrome. And there’s also a lot of good practical advice about productivity and tools.

This is of course another ___Hacker blog added to my RSS reader in addition to my beloved ProfHacker and LifeHacker.

How to be a successful PhD student

My derby wife ~~Logistic Aggression~~ Hanna Wallach, a Computer Science professor at UMass Amherst, published this guide on how to be a successful graduate student. It’s written for computer science students, but most of the advice applies to anyone. I think it’s a great compilation of many things that I either learned formally or informally during my PhD.

Love for ProjectTemplate

The advantage about writing a blog post about the tools you wish that you’d used throughout grad school is that, well, it makes you check them out. I went through the ProjectTemplate tutorial, and I’m hooked. Here’s the advantages as I see them:

Routine is your friend. This could really go for everything in your life. Small decisions contribue to decision fatigue, even if it’s something as simple as where to put a file. By automating as much as possible, you’re allowing yourself to save your finite willpower for real work instead of grunt work.
It’s easier to start somewhere and then customize, rather than start from the ground up. After four years in grad school, I have a system that I’ve hacked together for how to organize my analyses, but I would have rather not put the energy into creating the system in the first place. Designing a good system takes up a surprising amount of brain space, whereas modifying one takes much less. And since the author of ProjectTemplate seems to know what he’s doing, I doubt I’ll modify much.
Reproducibility should be as easy as possible. The way it works, ProjectTemplate makes it very easy to include (but not re-run) the code that you have for preprocessing the data, or other steps that you might only perform once during an analysis. And since reproducibility is such an important aspect of the scientific process, it should be as easy as possible.
Finding things should also be as easy as possible. This is quite similar to reproducibility, but on the individual level. I go back to old analyses all the time to borrow code, which can be extremely frustrating to me since some of my older analyses aren’t well organized (see #2). So it’s nice that you’ll know exactly where you placed something, because you have a uniform system in place.

Just as an aside, I get the impression from the computer scientists I’ve talked to that they don’t necessarily learn “how to code” in coursework, either, but are also expected to develop a system on their own. This I don’t understand, and perhaps when schooling catches up to the computer era we’ll see a change. For example, in high school I learned the five-paragraph format for writing essays, even though very few professional essayists use the format in publications. But it’s still a solid foundation for expression, which you can stray from as you become more confident in your abilities and command of the process. I suppose this argument requires that coding be taught in high school, but that’s another thing I’d love to see. One day!

The Setup (Part 1)

One of the more challenging things about beginning graduate school was learning what tools and software I needed in order to work efficiently. Unlike college where software requirements were laid out in front of me and everyone seemed to use the same tools, in graduate school the obscurity of the tools, as well as the number of options, seemed to multiply. Additionally since, like most Biostat students in my department, I came from a math background instead of a computing background, I had to drastically increase my computer literacy in a very short period of time. To help new students, a few of us more senior Biostat students have taken to presenting the tools we use in our departmental “Computing Club,” but a blog post really makes more sense. So without further ado, and in the style of one of my favorite blogs, I present…

The Setup (Part 1):

What hardware do you use?
I’m one of the last people in my department clinging to my PC. It’s absolutely true that Macs make it easier to get up and running with the common tools Biostatisticians need, but it’s also true that you can get all of those tools on Windows with a bit of initial work (and in my opinion you can even get more functionality and customization). I have a Lenovo ThinkPad X201 Tablet. This is my second tablet – I got the first because I wanted to take digital notes, which I did for all of grad school with some success. I ended up printing all of my notes out for studying, so I’m not sure it was more convenient than, say, scanning in my notes afterwards (a mammoth project I’m currently undertaking with all of my college coursework). However the tablet has been absolutely essential for teaching, and is very nice for annotating papers. I once bought a Mac that was not a tablet but then panicked and got this computer instead. It’s hard to take away functionality! I hook it up daily to an external monitor and use that huge Microsoft ergonomic keyboard and mouse. I’m actually not sure how so many people do not use external keyboards when they use computers so heavily. Perhaps I just have sensitive wrists.

And what software?
I do all of my research in R, which is really the academic norm for research statistics (and certainly the norm at Hopkins). I do all of my scripting in my beloved Notepad++, which becomes infinitely more awesome by using the little script NppToR. An amazing resource that Hopkins provides for it’s community is a high-powered, well backed-up computing cluster (so essentially all of my research is done on an extremely local cloud). To use the cluster you need to have an ssh client and an SCP protocol. Drastically oversimplified, what this means for me is 1) I have to have something I can open R with or run batch jobs on (the ssh client), and 2) I have to have somewhere you can drag my files. I use PuTTY for the ssh client and WinSCP for the SCP protocol. My workflow is that I open PuTTY to log onto the cluster and then log into R. Then I open WinSCP and open up my R scripts using Notepad++, and send the code to R by hitting F9 (yay NppToR!). If I’m running a batch job I open a second PuTTY window to submit those to. One last detail is that you must install and run Xming in order for graphs in R on the cluster to work. I think most students do more locally than I do, but I prefer doing as much on the cluster as possible. They always have the latest version of R and do a great job of backing things up. I find it’s less hassle once you get a good system down, and I like living on the cloud enough that that’s worth it to me. It also makes me less aware that I’m running Windows.

For writing I use LaTeX, again the academic norm. TeX is confusing the first time around, so here’s the crash course: TeX is a typesetting system, and LaTeX is the markup language. What this means is that you have to install TeX onto your computer, and then install a LaTeX editor. You “code” documents in a LaTeX editor and then compile them into PDFs, where they look pretty and professional and mathy. I use MiKTeX to install TeX, LEd as my LaTeX editor, and SumatraPDF as my PDF viewer. My workflow is that I open up LEd, then open up the PDF (using SumatraPDF) of the document I’m creating. Then whenever I compile my document, it automatically refreshes to the new version in Sumatra. You can also set it up so that if you double click a word in Sumatra, it automatically highlights it in LEd (tutorial). If you use Adobe this system won’t work because Adobe won’t refresh the document (instead you’ll get an error when you try to compile the code). Don’t use Adobe, basically.

edit: Might be converting to Texmaker as my LaTeX editor. It has spell-check and a built-in PDF Viewer, and took all of two minutes to set up. But I’d still recommend downloading Sumatra.

For presentations I still use PowerPoint (I hear you hatin’). That’s more the norm in genomics than in other biostatistics fields. We like it because we can easily share slides and put in pictures/graphs, and keep out too many equations.

I use Mendeley to organize my downloaded PDF papers. I looked for a PDF organizing solution for years before having this recommended to me, and it’s perfect. It syncs online, is cross-platform (with an iPad app), and most importantly it auto-generates bibtex files which are needed to create bibliographies within LaTeX documents. This means that creating bibtex files (complete with automatically generated citation key) is a drag-and-drop process, rather than a pasting-google-scholar-bibtex-results-into-text-editor process. I see this as a huge improvement.

On the extremely unlikely event that you use a tablet PC, I really like PDF Annotator, which you can get for free if you’re a Hopkins student. I use this software for teaching, grading and annotating papers. If I’m taking a bunch of notes the native Windows Journal software is quite nice and keeps file sizes much smaller than PDF Annotator.

What’s your dream setup? Or rather, what I wish I’d done differently…
There are a few places where I know I’m not efficient in my computing that I’d like to change, so I’ll list them here. You youngins can learn from my mistakes!

You might notice that I do my R coding on the cloud, but do my writing locally. This means that I can’t easily use Sweave, which is a really cool tool that allows you to LaTeX and run R code simultaneously (and is much applauded because it allows for easy reproducibility). Instead I have to import my data and graphs into my papers manually (the xtable package in R is essential to my life). This isn’t ideal and I’d like to change it, but the analyses I run usually take days to run, so Sweave loses a lot of it’s appeal (or at least the way I envisioned using Sweave). To ensure reproducibility I always publish my code on github. In fact I’m going to start migrating all of my research information (including project descriptions, links to papers, etc.) onto github.

I wish earlier in my career I had established 1) version control (using git), and 2) uniform project architecture (see for example ProjectTemplate). These are things that are discussed in the Hopkins computing coursework, but it’s unclear how many academics really use them. Github makes version control extremely accessible, but the disadvantage is that everything must be public unless you pay.

I wish I had discovered OneNote while taking courses because I’m sure I would have used and enjoyed it.

I also think standing desks are neat. One day…

Why “Part 1”?
Because Part 2 will be all of my non-academic software and hardware!

edit: I love The Setup and The Setup loves me! They featured me in their community section of the blog!

edit: I’ve compiled all of the tools I mention in this blog into a bitly bundle for your clicking and sharing convenience.