R

Posts about working with the R statistical programming language.

Reshaping Data with R

I was scrolling through my Facebook feed yesterday afternoon and one post stood out among the endless stream of memes. A friend had a data transformation question! She said:

Data handling question! I fairly frequently receive “tall” client data that needs to be flattened to be functional. So for example, Jane Doe may have twelve rows of data, each for a different month of performance, and I need to compress this so that her monthly performance displays on one row only, with columns for each month. Anyone know of a relatively painless way to do this quickly? I’ve been sending files like this to our data guy to flatten, but I’d love to be able to do it myself.

I immediately thought: “Hey! This sounds like a job for R!” Unfortunately, my R knowledge is limited to knowing that it’s a statistical programming language so in the interests of both helping out my friend and getting a little experience with something I’ve wanted to play around with, I set out to see how easy – or difficult – this task would be in R. It turned out to be quite simple.

Naturally I didn’t even have the R environment installed on my laptop so I obtained the installer from the R project site. Once it installed, I was off.

I started by creating a CSV that would simulate the scenario she described. My highly contrived data looked like this:

Name Month Value
Jane Doe January 10
Jane Doe February 11
Jane Doe March 5
Jane Doe April 8
Jane Doe May 10
Jane Doe June 15
Jane Doe July 12
Jane Doe August 8
Jane Doe September 11
Jane Doe October 7
Jane Doe November 9
Jane Doe December 13
John Doe January 5
John Doe February 3
John Doe March 8
John Doe April 7
John Doe May 6
John Doe June 5
John Doe July 5
John Doe August 8
John Doe September 4
John Doe October 2
John Doe November 5
John Doe December 9

With some source data ready it was time to move onto reading the data. It turns out that R has some nice capabilities for reading CSV files so I loaded it as follows:

df <- read.csv(file="C:\\Dev\\Data.csv", header = TRUE)

Here we’re reading the CSV and binding the resulting data frame to df.

Next, I needed to transform the data. After a bit of research I found there are a few ways to do this but using the reshape2 package seemed like the easiest approach. I installed the package and referenced it in my script.

library(reshape2)

Reshape2 is designed to transform data between wide and tall formats. Its functionality is built around two complementary functions; melt and cast. The melt function handles transforming wide data to a tall format whereas cast transforms tall data into a wide format. Given that my friend was trying to convert from tall to wide, cast was clearly the one I needed.

The cast function comes in two flavors: dcast and acast which result in either a new data frame or a vector/matrix/array respectively. To (hopefully) keep things simple I opted for the data frame approach and used dcast for this transformation.

To transform the data I simply needed to apply the dcast function to the CSV data. That is done like this:

pivot = dcast(df, Name~Month)

Here, the first argument df is our CSV data. The second argument specifies the column names that will serve as the axis for the transformation.

Finally I needed to persist the data. Just as with loading, R can easily write a CSV file.

write.csv(pivot, file="C:\\Dev\\Pivot.csv")

Finally, I executed the script and here’s the result:

Name April August December February January July June March May November October September
1 Jane Doe 8 8 13 11 10 12 15 5 10 9 7 11
2 John Doe 7 8 9 3 5 5 5 8 6 5 2 4


Although the data was transformed correctly it wasn’t quite what I expected to see. There are two problems here:

  1. The row numbers are included
  2. The months are alphabetized

It turns out that both of these problems are easy to solve. Let’s first handle eliminating the row numbers.

To exclude the row numbers from the resulting CSV we simply need to set the optional row.names parameter to FALSE in the write.csv function call.

write.csv(pivot, file="C:\\Dev\\Pivot.csv", row.names=FALSE)

Fixing the column order isn’t quite as simple but it’s really not bad either; we simply need to manually specify the column order by treating the frame as a vector and using the concatenation function to specify the new order.

In the interest of space I used the index approach but it’s also acceptable to use the column names instead.

pivot = dcast(df, Name~Month)[,c(1,6,5,9,2,10,8,7,3,13,12,11,4)]

Putting this all together, the final script now looks like this:

library(reshape2)

df <- read.csv(file="C:\\Dev\\Data.csv", header = TRUE)

pivot = dcast(df, Name~Month)[,c(1,6,5,9,2,10,8,7,3,13,12,11,4)]

write.csv(pivot, file="C:\\Dev\\Pivot.csv", row.names=FALSE)

Run this version and you’ll get the following, cleaned up result:

Name January February March April May June July August September October November December
Jane Doe 10 11 5 8 10 15 12 8 11 7 9 13
John Doe 5 3 8 7 6 5 5 8 4 2 5 9

So I ask this of my friends and readers that are more versed in R than I. Is this the preferred way to handle such transformations? If not, how would you approach it?