R – Hilary Parker

Sunsets in Google Calendar using R

I live (and work!) near one of the most beautiful vantage points for sunsets in possibly the entire US.

However almost every beautiful sunset I have seen from there has come from either 1) me walking out of work and noticing that the sky is bright pink, or 2) seeing someone post a sunset photo on Twitter (I know). Either way it ends with me practically sprinting to the Brooklyn Heights Promenade for the tail end of the sunset.

So, I decided to create some Google Calendar “appointments” for myself, using R and specifically the sunrise.set() function (thanks Carlos!). Because I’m a big user of Google Now, this means that–based on my location and travel time–my phone will buzz at me and tell me to leave for the Promenade in time to see the sunset.

Assuming you don’t live near me, you might want to customize your calendar to include the address of your own favorite vantage point for sunsets. So, I used this opportunity to create my personal R package and put this in as the first function. You can input your own address, timezone, etc. into the create_sunset_cal() function, and it will output a .CSV that meets Google’s requirements for importing a calendar. To get the function, just run the following:

library('devtools')
install.packages('StreamMetabolism')
install_github(repo = 'hilaryparker/hilary')
library('hilary')

You can upload the .CSV directly into your Google calendar (just be careful as it will import a different event for every day, so if you do it mistakenly it will be a pain to remove!). I’ll give instructions for creating a new calendar just for the sunsets, so you can remove it whenever you want if your calendar looks too cluttered.

Create a new calendar, called “Sunset”. If you want to share the calendar, make it Public.
Under the “Other calendars” heading, click on “Import calendar”.
Select the .CSV you created using the create_sunset_cal() function, making sure you select your newly-created “Sunset” calendar.
Henceforth be notified about the travel time to the sunset!
Enjoy the fruits of your labor.

Writing an R package from scratch

As I have worked on various projects at Etsy, I have accumulated a suite of functions that help me quickly produce tables and charts that I find useful. Because of the nature of iterative development, it often happens that I reuse the functions many times, mostly through the shameful method of copying the functions into the project directory. I have been a fan of the idea of personal R packages for a while, but it always seemed like A Project That I Should Do Someday and someday never came. Until…

Etsy has an amazing week called “hack week” where we all get the opportunity to work on fun projects instead of our regular jobs. I sat down yesterday as part of Etsy’s hack week and decided “I am finally going to make that package I keep saying I am going to make.” It took me such little time that I was hit with that familiar feeling of the joy of optimization combined with the regret of past inefficiencies (joygret?). I wish I could go back in time and create the package the first moment I thought about it, and then use all the saved time to watch cat videos because that really would have been more productive.

This tutorial is not about making a beautiful, perfect R package. This tutorial is about creating a bare-minimum R package so that you don’t have to keep thinking to yourself, “I really should just make an R package with these functions so I don’t have to keep copy/pasting them like a goddamn luddite.” Seriously, it doesn’t have to be about sharing your code (although that is an added benefit!). It is about saving yourself time. (n.b. this is my attitude about all reproducibility.)

(For more details, I recommend this chapter in Hadley Wickham’s Advanced R Programming book.)

Step 0: Packages you will need
The packages you will need to create a package are devtools and roxygen2. I am having you download the development version of the roxygen2 package.

install.packages("devtools")
library("devtools")
devtools::install_github("klutometis/roxygen")
library(roxygen2)

Step 1: Create your package directory
You are going to create a directory with the bare minimum folders of R packages. I am going to make a cat-themed package as an illustration.

setwd("parent_directory")
create("cats")

If you look in your parent directory, you will now have a folder called cats, and in it you will have two folders and one file called DESCRIPTION.

You should edit the DESCRIPTION file to include all of your contact information, etc.

Step 2: Add functions
If you’re reading this, you probably have functions that you’ve been meaning to create a package for. Copy those into your R folder. If you don’t, may I suggest something along the lines of:

cat_function <- function(love=TRUE){
    if(love==TRUE){
        print("I love cats!")
    }
    else {
        print("I am not a cool person.")
    }
}

Save this as a cat_function.R to your R directory.

(cats-package.r is auto-generated when you create the package.)

Step 3: Add documentation
This always seemed like the most intimidating step to me. I’m here to tell you — it’s super quick. The package roxygen2 that makes everything amazing and simple. The way it works is that you add special comments to the beginning of each function, that will later be compiled into the correct format for package documentation. The details can be found in the roxygen2 documentation — I will just provide an example for our cat function.

The comments you need to add at the beginning of the cat function are, for example, as follows:

#' A Cat Function
#'
#' This function allows you to express your love of cats.
#' @param love Do you love cats? Defaults to TRUE.
#' @keywords cats
#' @export
#' @examples
#' cat_function()

cat_function <- function(love=TRUE){
    if(love==TRUE){
        print("I love cats!")
    }
    else {
        print("I am not a cool person.")
    }
}

I’m personally a fan of creating a new file for each function, but if you’d rather you can simply create new functions sequentially in one file — just make sure to add the documentation comments before each function.

Step 4: Process your documentation
Now you need to create the documentation from your annotations earlier. You’ve already done the “hard” work in Step 3. Step 4 is as easy doing this:

setwd("./cats")
document()

This automatically adds in the .Rd files to the man directory, and adds a NAMESPACE file to the main directory. You can read up more about these, but in terms of steps you need to take, you really don’t have to do anything further.

(Yes I know my icons are inconsistent. Yes I tried to fix that.)

Step 5: Install!
Now it is as simple as installing the package! You need to run this from the parent working directory that contains the cats folder.

setwd("..")
install("cats")

Now you have a real, live, functioning R package. For example, try typing ?cat_function. You should see the standard help page pop up!

(Bonus) Step 6: Make the package a GitHub repo
This isn’t a post about learning to use git and GitHub — for that I recommend Karl Broman’s Git/GitHub Guide. The benefit, however, to putting your package onto GitHub is that you can use the devtools install_github() function to install your new package directly from the GitHub page.

install_github('cats','github_username')

Step 7-infinity: Iterate
This is where the benefit of having the package pulled together really helps. You can flesh out the documentation as you use and share the package. You can add new functions the moment you write them, rather than waiting to see if you’ll reuse them. You can divide up the functions into new packages. The possibilities are endless!

Additional pontifications: If I have learned anything from my (amazing and eye-opening) first year at Etsy, it’s that the best products are built in small steps, not by waiting for a perfect final product to be created. This concept is called the minimum viable product — it’s best to get a project started and improve it through iteration. R packages can seem like a big, intimidating feat, and they really shouldn’t be. The minimum viable R package is a package with just one function!

Additional side-notes: I learned basically all of these tricks at the rOpenSci hackathon. My academic sister Alyssa wrote a blog post describing how great it was. Hadley Wickham gets full credit for envisioning that R packages should be the easiest way to share code, and making functions/resources that make it so easy to do so.

Personal R Packages

I came across this R package on GitHub, and it made me so excited that I decided to write a post about it. It’s a compilation by Karl Broman of various R functions that he’s found helpful to write throughout the years.

Wouldn’t it be great if incoming graduate students in Biostatistics/Statistics were taught to create a personal repository of functions like this? Not only is it a great way to learn how to write an R package, but it also encourages good coding techniques for newer students (since it encourages them to write separate functions with documentation). It also allows for easy reprodicibility and collaboration both within the school and with the broader community. Case in point — I wanted to use one of Karl’s functions (which I found via his blog… which I found via Twitter), and all I had to do was run:

install_github('broman','kbroman')
library('broman')

(Note that install_github is a function in the devtools package. I would link to the GitHub page for that package but somehow that seems circular…)

For whatever reason, when I think of R packages, I think of big, unified projects with a specified scientific aim. This was a great reminder that R packages exist solely for making it easier to distribute code for any purpose. Distributing tips and tricks is certainly a worthy purpose!

Hilary: the most poisoned baby name in US history

I’ve always had a special fondness for my name, which — according to Ryan Gosling in “Lars and the Real Girl” — is a scientific fact for most people (Ryan Gosling constitutes scientific proof in my book). Plus, the root word for Hilary is the Latin word “hilarius” meaning cheerful and merry, which is the same root word for “hilarious” and “exhilarating.” It’s a great name.

Several years ago I came across this blog post, which provides a cursory analysis for why “Hillary” is the most poisoned name of all time. The author is careful not to comment on the details of why “Hillary” may have been poisoned right around 1992, but I’ll go ahead and make the bold causal conclusion that it’s because that was the year that Bill Clinton was elected, and thus the year Hillary Clinton entered the public sphere and was generally reviled for not wanting to bake cookies or something like that. Note that this all happened when I was 7 years old, so I spent the formative years of 7-15 being called “Hillary Clinton” whenever I introduced myself. Luckily, I was a feisty feminist from a young age and rejoiced in the comparison (and life is not about being popular).

In the original post the author bemoans the lack of research assistants to perform his data extraction for a more complete analysis. Fortunately, in this era we have replaced human jobs with computers, and the data can be easily extracted using programming. This weekend I took the opportunity to learn how to scrape the social security data myself and do a more complete analysis of all of the names on record.

Is Hilary/Hillary really the most rapidly poisoned name in recorded American history? An analysis.

I will follow up this post with more details on how to perform web-scraping with R (for this I am infinitely indebted to my friend Mark — check out his storyboard project and be amazed!). For now, suffice it to say that I was able to collect from the social security website the data for every year between 1880 and 2011 for the 1000 most popular baby names. For each of the 1000 names in a given year, I collected the raw number of babies given that name, as well as the percentage of babies given that name, and the rank of that name. For girls, this resulted in 4110 total names.

In the original analysis, the author looked at the changed rank of “Hillary.” The ranks are interesting, but we have more finely-tuned data than that available from the SSA. The raw numbers of babies named a certain name are likewise interesting, but do not normalize for the population. Thus the percentages of babies named a certain name is the best measurement.

Looking at the absolute chance in percentages is interesting, but would not tell the full story. A change of, say 15% to 14% would be quite different and less drastic than a change from 2% to 1%, but the absolute change in percentage would measure those two things equally. Thus, I need a measure of the relative change in the percentages — that is, the percent change in percentages (confusing, I know). Fortunately the public health field has dealt with this problem for a long time, and has a measurement called the relative risk, where “risk” refers to the proportion of babies given a certain name. For example, let’s say the percentage of babies named “Jane” is 1% of the population in 1990, and 1.2% of the population in 1991. The relative risk of being named “Jane” in 1991 versus 1990 is 1.2 (that is, it’s (1.2/1)=1.2 times as probable, or (1.2-1)*100=20% more likely). In this case, however, I’m interested in instances where the percentage of children with a certain name decreases. The way to make the most sensible statistics in this case is to calculate the relative risk again, but in this case think of it as a decrease. That is, if “Jane” was at 1.5% in 1990 and 1.3% in 1991, then the relative risk of being named “Jane” in 1991 compared to 1990 is (1.3/1.5)=0.87. That is, it is (1-0.87)*100=13% less likely that a baby will be named “Jane” in 1991 compared to 1990.

(Note that I’m not doing any model fitting here because I’m not interested in any parameter estimates — I have my entire population! I’m just summarizing the data in a way that makes sense.)

So, for each of the 4110 names that I collected, I calculated the relative risk going from one year to the next, all the way from 1880 to 2011. I then pulled out the names with the biggest percent drops from one year to the next.

#6?? I’m sorry, but if I’m going to have one of the most rapidly poisoned names in US history, it best be #1. I didn’t come here to make friends, I came here to win. Furthermore, the names on this list seemed… peculiar to say the least. I decided to plot out the percentage of babies named each of the names to get a better idea of what was going on. (Click through to see the full-sized plot. Note that the y-axis is Percent, so 0.20 means 0.20%.)

These plots looked quite curious to me. While the names had very steep drop-offs, they also had very steep drop-ins as well.

This is where this project got deliriously fun. For each of the names that “dropped in” I did a little research on the name and the year. “Dewey” popped up in 1898 because of the Spanish-American War — people named their daughters after George Dewey. “Deneen” was one name of a duo with a one-hit wonder in 1968. “Katina” and “Catina” were wildly popular because in 1972 in the soap opera Where the Heart Is a character is born named Katina. “Farrah” became popular in 1976 when Charlie’s Angels, starring Farrah Fawcett, debuted (notice that the name becomes popular in 2009 when Farrah Fawcett died). “Renata” was hard to pin down — perhaps it was popular because of this opera singer who seemed to be on TV a lot in the late 1970s. “Infant” became a popular baby name in the late 1980s for reasons that completely defy my comprehension, and that are utterly un-Google-able. (Edit: someone pointed out on facebook that it’s possible this is due to a change in coding conventions for unnamed babies. This would make more sense, but would also make me sad. Edit 2: See the comments for an explanation!)

I think we all know why “Iesha” became popular in 1989:

“Khadijah” was a character played by Queen Latifa in Living Single, and “Ashanti” was popular because of Ashanti, of course.

“Hilary”, though, was clearly different than these flash-in-the-pan names. The name was growing in popularity (albeit not monotonically) for years. So to remove all of the fad names from the list, I chose only the names that were in the top 1000 for over 20 years, and updated the graph (note that I changed the range on the y-axis).

I think it’s pretty safe to say that, among the names that were once stable and then had a sudden drop, “Hilary” is clearly the most poisoned. I am not paying too much attention to the names that had sharp drops in the late 1800s because the population was so much smaller then, and thus it was easier to drop percentage points without a large drop in raw numbers. I also did a parallel analysis for boys, and aside from fluctuations in the late 1890s/early 1900s, the only name that comes close to this rate of poisoning is Nakia, which became popular because of a short-lived TV show in the 1970s.

At this point you’re probably wondering where “Hillary” is. As it turns out, “Hillary” took two years to descend from the top, thus diluting out the relative risk for any one year (it’s highest single-year drop was 61% in 1994). If I examine slightly more names (now the top 39 most poisoned) and again filter for fad names, both “Hilary” and “Hillary” are on the plot, and clearly the most poisoned.

(The crazy line is for “Marian” and the spike is due to the fact that 1954 was a Catholic Marian year — if it weren’t an already popular name, it would have been filtered as a fad. And the “Christin” spike might very well be due to a computer glitch that truncated the name “Christina”! Amazing!!)

So, I can confidently say that, defining “poisoning” as the relative loss of popularity in a single year and controlling for fad names, “Hilary” is absolutely the most poisoned woman’s name in recorded history in the US.

Code for this project is available on GitHub.

(Personal aside: I will get sentimental for a moment, and mention that my mother was a at Wellesley the same time as Hillary Rodham. While she already knew that she wanted to name her future daughter “Hilary” at that point, when she saw Hillary speak at a student event, she thought, “THAT is what I want my daughter to be like.” Which was empirically the polar opposite of what the nation felt in 1992. But my mom was right and way ahead of her time.)

Update: This seems to be an analysis everyone is interested in. For perhaps the first time in internet history, Godwin’s Law is wholly appropriate.

Creating a random calendar with R

Sometimes this is my life. But it’s so satisfying when you write a program that saves you time! Here is an example.

The Problem:

For several years at Hopkins I have been involved in teaching a large (500+ person) introductory Biostatistics class. This class usually has a team of 12-15 teaching assistants, who together staff the twice-daily office hours. TAs are generally assigned to “random” office-hours, with the intention that students in the class get a variety of view-points if, for example, they can only come to Tuesday afternoon office hours.

When I started the course, the random office-hours were assigned by our very dedicated administrative coordinator, who sent them out via email on a spreadsheet. Naturally, there were several individual TA conflicts, which would result in more emails, which would lead to version control problems and — ultimately — angry students at empty office hours. Additionally, assigning TAs to random time-slots for an 8-week class was a nightmare for the administrative assistant.

The Solution:

I really wanted us to start using Google Calendar, so that students could easily load the calendars into their own personal calendars, and that changes on the master calendar would automatically be pushed to students’ calendars. So, in order to achieve this goal, I wrote a function in R that would automatically generate a random calendar for the TAs, write these calendars to CSV documents that I could then upload to a master Google Calendar. It has worked wonderfully!

My general philosophy with this project was that it is much easier to create a draft and then modify it, than it is to create something from scratch that fits every single parameter. So, I controlled for as many factors as I reasonably could (holidays, weekends, days when we need double the TAs, etc.). Then, as TAs have conflicts or other problems arise, I can just go back and modify the calendar to fit individual needs. You’d be amazed — even with each TA having several constraints on his or her time, I generally only need to make a few modifications to the calendar before the final draft. Randomness works!

There are a few tricks that made this function work seamlessly. My code below depends on the chron package. To assign TAs, it’s best to think about the dates for your class as a sequence of slots that need to be filled, omitting weekend days, holidays and other days off, and accommodating for days when you might need twice as many TAs (for example right before exams when office hours are flooded).

dts <- seq.dates(start_date,end_date)
weekdts <- weekdays(dts)
dates <- dts[weekdts!="Sat"&weekdts!="Sun"&!as.character(dts)%in%no_TA_dates]
dates <- c(dates,double_days)
dates <- sort(c(dates,dates))

Now that you have the sequence of dates, you need to randomly fill them. I accomplish this by dividing the sequence of dates into bins that are equal to the number of TAs, and then randomly sampling the TA names without replacement for each of the bins. This might leave a remainder of un-filled days at the end, so you have to make up for that as well.

len_dates <- length(dates)
len_tas <- length(ta_names)
mult <- floor(len_dates/len_tas)
temp <- rep(NA,len_tas)
ta_sched <- 0
for(i in 1:mult){
    temp<-sample(ta_names,len_tas,replace=FALSE)
    ta_sched<-c(ta_sched,temp)
}
ta_sched <- ta_sched[-1]
rem <- length(dates)-length(ta_sched)
temp <- sample(ta_names,rem,replace=FALSE)
ta_sched<-c(ta_sched,temp)

This way, even though the TAs are randomly assigned, I avoid assigning one person to time-slots only during the last week of classes, for example.

The final step is writing these schedules out into a CSV format that can be uploaded to Google Calendar (here’s the format needed). I create a separate CSV for each of the TAs, so that each has his or her own calendar that can be imported into a personal calendar.

nms<-c('Subject','Start Date','Start Time','End Date','End Time','All Day Event','Description','Location','Private')
len_names <- length(nms)
mat <- matrix(nrow=len_dates,ncol=len_names)
mat <- data.frame(mat)
colnames(mat)<-nms
mat$Subject <- ta_sched
mat$"Start Date" <- dates
mat$"End Date" <- dates
mat$"All Day Event" <- "False"
mat$Description <- "Biostat 621 TA Schedule"
mat$Private <- "False"
start_times <- c("12:15:00 PM","2:30:00 PM")
end_times <- c("1:15:00 PM","3:20:00 PM")
mat$"Start Time" <- start_times
mat$"End Time" <- end_times
mat$Location <- ta_location
for(i in 1:len_tas){
    filename<-paste(ta_names[i],".csv",sep="")
    temp<-mat[mat$Subject==ta_names[i],]
    write.csv(temp,file=filename,quote=FALSE,row.names=FALSE)
}

This saves a HUGE amount of hassle!

The Workflow:

Here’s the workflow for creating the calendars. Of course, this is very specific to the class I helped teach, but the general workflow might be helpful if you’re facing a similar scheduling nightmare!

Using this function (github — use the example template which calls the gen_cal.R function), create a random calendar for each of the TAs. These calendars will be saved as CSV documents which can then be uploaded to Google Calendar. There are options for the starting day of class, ending day of class, days off from class, days when you want twice as many TAs (for example, right before an exam), and some extra info like the location of the office hours.
On Google Calendar, create a calendar for each of the TAs (I might name mine “Hilary Biostat 621” for example). Make the calendar public (this helps with sharing it to TAs and posting it online)
For each TA, import the CSV document created in step 1 by clicking “import calendar” under the “other calendars” tab in Google Calendar. For the options, select the CSV document for a specific TA, and for the “calendar” option, select the calendar you created in step 2 for that specific TA. (Life will be easiest if you only perform this step once, because if you try to import a calendar twice, you’ll get error messages that you’ve already imported these events before.)
Bonus step: Once all the calendars are imported, embed the calendars onto a website (keep secret from students in the class if you don’t want them to know when certain TAs are working!). This serves two purposes. First, it serves as the master calendar for all of the TAs and instructors. Second, by distributing it to TAs, they can add their calendar to their personal Google Calendar by clicking the plus icon on the bottom right corner. Your final beautiful product will look something like this! (Note that I also created a calendar that has important class dates like the quizzes and exams).

The Result:

Time spent developing and implementing: ~3 hours. Time spent each semester re-creating for new TAs: ~30 minutes. Time saved by myself, the administrative coordinator, and all of the TAs — and instead spent feeling awesome: infinity.

Love for ProjectTemplate

The advantage about writing a blog post about the tools you wish that you’d used throughout grad school is that, well, it makes you check them out. I went through the ProjectTemplate tutorial, and I’m hooked. Here’s the advantages as I see them:

Routine is your friend. This could really go for everything in your life. Small decisions contribue to decision fatigue, even if it’s something as simple as where to put a file. By automating as much as possible, you’re allowing yourself to save your finite willpower for real work instead of grunt work.
It’s easier to start somewhere and then customize, rather than start from the ground up. After four years in grad school, I have a system that I’ve hacked together for how to organize my analyses, but I would have rather not put the energy into creating the system in the first place. Designing a good system takes up a surprising amount of brain space, whereas modifying one takes much less. And since the author of ProjectTemplate seems to know what he’s doing, I doubt I’ll modify much.
Reproducibility should be as easy as possible. The way it works, ProjectTemplate makes it very easy to include (but not re-run) the code that you have for preprocessing the data, or other steps that you might only perform once during an analysis. And since reproducibility is such an important aspect of the scientific process, it should be as easy as possible.
Finding things should also be as easy as possible. This is quite similar to reproducibility, but on the individual level. I go back to old analyses all the time to borrow code, which can be extremely frustrating to me since some of my older analyses aren’t well organized (see #2). So it’s nice that you’ll know exactly where you placed something, because you have a uniform system in place.

Just as an aside, I get the impression from the computer scientists I’ve talked to that they don’t necessarily learn “how to code” in coursework, either, but are also expected to develop a system on their own. This I don’t understand, and perhaps when schooling catches up to the computer era we’ll see a change. For example, in high school I learned the five-paragraph format for writing essays, even though very few professional essayists use the format in publications. But it’s still a solid foundation for expression, which you can stray from as you become more confident in your abilities and command of the process. I suppose this argument requires that coding be taught in high school, but that’s another thing I’d love to see. One day!

The Setup (Part 1)

One of the more challenging things about beginning graduate school was learning what tools and software I needed in order to work efficiently. Unlike college where software requirements were laid out in front of me and everyone seemed to use the same tools, in graduate school the obscurity of the tools, as well as the number of options, seemed to multiply. Additionally since, like most Biostat students in my department, I came from a math background instead of a computing background, I had to drastically increase my computer literacy in a very short period of time. To help new students, a few of us more senior Biostat students have taken to presenting the tools we use in our departmental “Computing Club,” but a blog post really makes more sense. So without further ado, and in the style of one of my favorite blogs, I present…

The Setup (Part 1):

What hardware do you use?
I’m one of the last people in my department clinging to my PC. It’s absolutely true that Macs make it easier to get up and running with the common tools Biostatisticians need, but it’s also true that you can get all of those tools on Windows with a bit of initial work (and in my opinion you can even get more functionality and customization). I have a Lenovo ThinkPad X201 Tablet. This is my second tablet – I got the first because I wanted to take digital notes, which I did for all of grad school with some success. I ended up printing all of my notes out for studying, so I’m not sure it was more convenient than, say, scanning in my notes afterwards (a mammoth project I’m currently undertaking with all of my college coursework). However the tablet has been absolutely essential for teaching, and is very nice for annotating papers. I once bought a Mac that was not a tablet but then panicked and got this computer instead. It’s hard to take away functionality! I hook it up daily to an external monitor and use that huge Microsoft ergonomic keyboard and mouse. I’m actually not sure how so many people do not use external keyboards when they use computers so heavily. Perhaps I just have sensitive wrists.

And what software?
I do all of my research in R, which is really the academic norm for research statistics (and certainly the norm at Hopkins). I do all of my scripting in my beloved Notepad++, which becomes infinitely more awesome by using the little script NppToR. An amazing resource that Hopkins provides for it’s community is a high-powered, well backed-up computing cluster (so essentially all of my research is done on an extremely local cloud). To use the cluster you need to have an ssh client and an SCP protocol. Drastically oversimplified, what this means for me is 1) I have to have something I can open R with or run batch jobs on (the ssh client), and 2) I have to have somewhere you can drag my files. I use PuTTY for the ssh client and WinSCP for the SCP protocol. My workflow is that I open PuTTY to log onto the cluster and then log into R. Then I open WinSCP and open up my R scripts using Notepad++, and send the code to R by hitting F9 (yay NppToR!). If I’m running a batch job I open a second PuTTY window to submit those to. One last detail is that you must install and run Xming in order for graphs in R on the cluster to work. I think most students do more locally than I do, but I prefer doing as much on the cluster as possible. They always have the latest version of R and do a great job of backing things up. I find it’s less hassle once you get a good system down, and I like living on the cloud enough that that’s worth it to me. It also makes me less aware that I’m running Windows.

For writing I use LaTeX, again the academic norm. TeX is confusing the first time around, so here’s the crash course: TeX is a typesetting system, and LaTeX is the markup language. What this means is that you have to install TeX onto your computer, and then install a LaTeX editor. You “code” documents in a LaTeX editor and then compile them into PDFs, where they look pretty and professional and mathy. I use MiKTeX to install TeX, LEd as my LaTeX editor, and SumatraPDF as my PDF viewer. My workflow is that I open up LEd, then open up the PDF (using SumatraPDF) of the document I’m creating. Then whenever I compile my document, it automatically refreshes to the new version in Sumatra. You can also set it up so that if you double click a word in Sumatra, it automatically highlights it in LEd (tutorial). If you use Adobe this system won’t work because Adobe won’t refresh the document (instead you’ll get an error when you try to compile the code). Don’t use Adobe, basically.

edit: Might be converting to Texmaker as my LaTeX editor. It has spell-check and a built-in PDF Viewer, and took all of two minutes to set up. But I’d still recommend downloading Sumatra.

For presentations I still use PowerPoint (I hear you hatin’). That’s more the norm in genomics than in other biostatistics fields. We like it because we can easily share slides and put in pictures/graphs, and keep out too many equations.

I use Mendeley to organize my downloaded PDF papers. I looked for a PDF organizing solution for years before having this recommended to me, and it’s perfect. It syncs online, is cross-platform (with an iPad app), and most importantly it auto-generates bibtex files which are needed to create bibliographies within LaTeX documents. This means that creating bibtex files (complete with automatically generated citation key) is a drag-and-drop process, rather than a pasting-google-scholar-bibtex-results-into-text-editor process. I see this as a huge improvement.

On the extremely unlikely event that you use a tablet PC, I really like PDF Annotator, which you can get for free if you’re a Hopkins student. I use this software for teaching, grading and annotating papers. If I’m taking a bunch of notes the native Windows Journal software is quite nice and keeps file sizes much smaller than PDF Annotator.

What’s your dream setup? Or rather, what I wish I’d done differently…
There are a few places where I know I’m not efficient in my computing that I’d like to change, so I’ll list them here. You youngins can learn from my mistakes!

You might notice that I do my R coding on the cloud, but do my writing locally. This means that I can’t easily use Sweave, which is a really cool tool that allows you to LaTeX and run R code simultaneously (and is much applauded because it allows for easy reproducibility). Instead I have to import my data and graphs into my papers manually (the xtable package in R is essential to my life). This isn’t ideal and I’d like to change it, but the analyses I run usually take days to run, so Sweave loses a lot of it’s appeal (or at least the way I envisioned using Sweave). To ensure reproducibility I always publish my code on github. In fact I’m going to start migrating all of my research information (including project descriptions, links to papers, etc.) onto github.

I wish earlier in my career I had established 1) version control (using git), and 2) uniform project architecture (see for example ProjectTemplate). These are things that are discussed in the Hopkins computing coursework, but it’s unclear how many academics really use them. Github makes version control extremely accessible, but the disadvantage is that everything must be public unless you pay.

I wish I had discovered OneNote while taking courses because I’m sure I would have used and enjoyed it.

I also think standing desks are neat. One day…

Why “Part 1”?
Because Part 2 will be all of my non-academic software and hardware!

edit: I love The Setup and The Setup loves me! They featured me in their community section of the blog!

edit: I’ve compiled all of the tools I mention in this blog into a bitly bundle for your clicking and sharing convenience.