We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!
Data Management

# Data Manipulation with reshape2

In this article, I will show you how you can use the reshape2 package to convert data from wide to long format and vice versa. It was written and is maintained by Hadley Wickham.

## Long format vs Wide format

In wide format data, each column represents a different variable. For example, the mtcars dataset from the datasets package can be represented in wide format as:

# Wide format
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In long format data, one column contains all the possible variables, and then another column contains their respective values. The above data can be represented in long format as:

# Long format
variable value
1      mpg  21.0
2      mpg  21.0
3      mpg  22.8
4      mpg  21.4
5      mpg  18.7
6      mpg  18.1
...
variable value
347     carb     2
348     carb     2
349     carb     4
350     carb     6
351     carb     8
352     carb     2

Long format data can have more than two columns, especially when id variables are provided, as demonstrated below.

In practical applications, while wide format is more readable, long format is easier to analyze. Therefore, it is useful to know how to convert between the two.

There are two main functions in reshape2:

• melt – convert data from wide format to long format
• cast – convert data from long format to wide format

## melt

The melt function is used to convert data from wide format to long format. Let us work with the mtcars dataset from the datasets package. The data is originally in the wide format and looks as shown above. We can melt it to long format as follows:

mtcars$car <- rownames(mtcars) mtcarsMelt <- melt(mtcars) head(mtcarsMelt) cyl car variable value 1 6 Mazda RX4 mpg 21.0 2 6 Mazda RX4 Wag mpg 21.0 3 4 Datsun 710 mpg 22.8 4 6 Hornet 4 Drive mpg 21.4 5 8 Hornet Sportabout mpg 18.7 6 6 Valiant mpg 18.1 Here, melt automatically selected car and cyl as id variables. If you want to use a different set of id variables, they can be specified using id.vars. We can also rename the variable and value columns using the variable.name and value.name parameters. For example, if we want to classify all the cars by the number of cylinders and gears, we can do as follows: mtcarsMelt <- melt(mtcars, id.vars = c('cyl', 'gear'), variable.name = 'carVariable', value.name = 'carValue') head(mtcarsMelt) cyl gear carVariable carValue 1 6 4 mpg 21 2 6 4 mpg 21 3 4 4 mpg 22.8 4 6 3 mpg 21.4 5 8 3 mpg 18.7 6 6 3 mpg 18.1 tail(mtcarsMelt) cyl gear carVariable carValue 315 4 5 car Porsche 914-2 316 4 5 car Lotus Europa 317 8 5 car Ford Pantera L 318 6 5 car Ferrari Dino 319 8 5 car Maserati Bora 320 4 4 car Volvo 142E Generally, it is a good idea to use a combination of variables which can uniquely identify each data point as id variables. Hence, it is not a good idea to use cyl and gear since there are multiple data points which have the same value for the two. This will lead to problems when trying to convert the data back to wide format. ## cast The cast function is used to convert data from long format to wide format. There are two main types of the cast function: • dcast – returns a dataframe as the output • acast – returns a vector/matrix/array as the output Since dataframe objects are the most common, I will demonstrate how to use dcast. We can convert data back to wide format from long format as follows: mtcarsMelt <- melt(mtcars) mtcarsCast <- dcast(mtcarsMelt, car + cyl ~ variable) head(mtcarsCast) car cyl mpg disp hp drat wt qsec vs am gear carb 1 AMC Javelin 8 15.2 304 150 3.15 3.435 17.30 0 0 3 2 2 Cadillac Fleetwood 8 10.4 472 205 2.93 5.250 17.98 0 0 3 4 3 Camaro Z28 8 13.3 350 245 3.73 3.840 15.41 0 0 3 4 4 Chrysler Imperial 8 14.7 440 230 3.23 5.345 17.42 0 0 3 4 5 Datsun 710 4 22.8 108 93 3.85 2.320 18.61 1 1 4 1 6 Dodge Challenger 8 15.5 318 150 2.76 3.520 16.87 0 0 3 2 The dcast function uses a formula to convert the data to wide format. Here, the formula is given by car + cyl ~ variable where car and cyl are the id variables, and variable is the name of the variable column. Let us see what would happen if we had used cyl and gear as id variables instead. mtcarsCast <- dcast(mtcarsMelt, cyl + gear ~ variable) head(mtcarsCast) cyl gear mpg disp hp drat wt qsec vs am carb car 1 4 3 1 1 1 1 1 1 1 1 1 1 2 4 4 8 8 8 8 8 8 8 8 8 8 3 4 5 2 2 2 2 2 2 2 2 2 2 4 6 3 2 2 2 2 2 2 2 2 2 2 5 6 4 4 4 4 4 4 4 4 4 4 4 6 6 5 1 1 1 1 1 1 1 1 1 1 We get a warning that says: Aggregation function missing: defaulting to length. The dataset just shows the total number of observations for each combination of cyl and gear. This is because the dcast function is unable to uniquely identify each data point. However, it still has some uses. For example, we can find the mean of all variables for each combination of cyl and gear by using the fun.aggregate parameter as follows: mtcars$car <- NULL
mtcarsMelt <- melt(mtcars, id.vars = c('cyl', 'gear'))
mtcarsCast <- dcast(mtcarsMelt, cyl + gear ~ variable, fun.aggregate = mean)
6   6    5 19.700 145.000 175.0 3.62 2.770000 15.5000 0.0 1.00  6.0
Here, we removed the car column because we do not want any non-numeric values in the value column of mtcarsMelt, or else it will cause errors when trying to calculate the mean.