In this article, I will show you how you can use the reshape2
package to convert data from wide to long format and vice versa. It was written and is maintained by Hadley Wickham.
Long format vs Wide format
In wide format data, each column represents a different variable. For example, the mtcars
dataset from the datasets
package can be represented in wide format as:
# Wide format mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In long format data, one column contains all the possible variables, and then another column contains their respective values. The above data can be represented in long format as:
# Long format variable value 1 mpg 21.0 2 mpg 21.0 3 mpg 22.8 4 mpg 21.4 5 mpg 18.7 6 mpg 18.1 ... variable value 347 carb 2 348 carb 2 349 carb 4 350 carb 6 351 carb 8 352 carb 2
Long format data can have more than two columns, especially when id variables are provided, as demonstrated below.
In practical applications, while wide format is more readable, long format is easier to analyze. Therefore, it is useful to know how to convert between the two.
There are two main functions in
reshape2
:
- melt – convert data from wide format to long format
- cast – convert data from long format to wide format
melt
The melt
function is used to convert data from wide format to long format. Let us work with the mtcars
dataset from the datasets
package. The data is originally in the wide format and looks as shown above. We can melt it to long format as follows:
mtcars$car <- rownames(mtcars) mtcarsMelt <- melt(mtcars) head(mtcarsMelt) cyl car variable value 1 6 Mazda RX4 mpg 21.0 2 6 Mazda RX4 Wag mpg 21.0 3 4 Datsun 710 mpg 22.8 4 6 Hornet 4 Drive mpg 21.4 5 8 Hornet Sportabout mpg 18.7 6 6 Valiant mpg 18.1
Here, melt
automatically selected car
and cyl
as id variables. If you want to use a different set of id variables, they can be specified using id.vars
. We can also rename the variable
and value
columns using the variable.name
and value.name
parameters. For example, if we want to classify all the cars by the number of cylinders and gears, we can do as follows:
mtcarsMelt <- melt(mtcars, id.vars = c('cyl', 'gear'), variable.name = 'carVariable', value.name = 'carValue') head(mtcarsMelt) tail(mtcarsMelt) cyl gear carVariable carValue 1 6 4 mpg 21 2 6 4 mpg 21 3 4 4 mpg 22.8 4 6 3 mpg 21.4 5 8 3 mpg 18.7 6 6 3 mpg 18.1 cyl gear carVariable carValue 315 4 5 car Porsche 914-2 316 4 5 car Lotus Europa 317 8 5 car Ford Pantera L 318 6 5 car Ferrari Dino 319 8 5 car Maserati Bora 320 4 4 car Volvo 142E
Generally, it is a good idea to use a combination of variables which can uniquely identify each data point as id variables. Hence, it is not a good idea to use cyl
and gear
since there are multiple data points which have the same value for the two. This will lead to problems when trying to convert the data back to wide format.
cast
The cast
function is used to convert data from long format to wide format. There are two main types of the cast
function:
- dcast – returns a dataframe as the output
- acast – returns a vector/matrix/array as the output
Since dataframe objects are the most common, I will demonstrate how to use dcast
. We can convert data back to wide format from long format as follows:
mtcarsMelt <- melt(mtcars) mtcarsCast <- dcast(mtcarsMelt, car + cyl ~ variable) head(mtcarsCast) car cyl mpg disp hp drat wt qsec vs am gear carb 1 AMC Javelin 8 15.2 304 150 3.15 3.435 17.30 0 0 3 2 2 Cadillac Fleetwood 8 10.4 472 205 2.93 5.250 17.98 0 0 3 4 3 Camaro Z28 8 13.3 350 245 3.73 3.840 15.41 0 0 3 4 4 Chrysler Imperial 8 14.7 440 230 3.23 5.345 17.42 0 0 3 4 5 Datsun 710 4 22.8 108 93 3.85 2.320 18.61 1 1 4 1 6 Dodge Challenger 8 15.5 318 150 2.76 3.520 16.87 0 0 3 2
The dcast
function uses a formula to convert the data to wide format. Here, the formula is given by car + cyl ~ variable
where car
and cyl
are the id variables, and variable
is the name of the variable column.
Let us see what would happen if we had used cyl
and gear
as id variables instead.
mtcarsCast <- dcast(mtcarsMelt, cyl + gear ~ variable) head(mtcarsCast) cyl gear mpg disp hp drat wt qsec vs am carb car 1 4 3 1 1 1 1 1 1 1 1 1 1 2 4 4 8 8 8 8 8 8 8 8 8 8 3 4 5 2 2 2 2 2 2 2 2 2 2 4 6 3 2 2 2 2 2 2 2 2 2 2 5 6 4 4 4 4 4 4 4 4 4 4 4 6 6 5 1 1 1 1 1 1 1 1 1 1
We get a warning that says: Aggregation function missing: defaulting to length. The dataset just shows the total number of observations for each combination of cyl
and gear
. This is because the dcast
function is unable to uniquely identify each data point. However, it still has some uses. For example, we can find the mean of all variables for each combination of cyl
and gear
by using the fun.aggregate
parameter as follows:
mtcars$car <- NULL mtcarsMelt <- melt(mtcars, id.vars = c('cyl', 'gear')) mtcarsCast <- dcast(mtcarsMelt, cyl + gear ~ variable, fun.aggregate = mean) head(mtcarsCast) cyl gear mpg disp hp drat wt qsec vs am carb 1 4 3 21.500 120.100 97.0 3.70 2.465000 20.0100 1.0 0.00 1.0 2 4 4 26.925 102.625 76.0 4.11 2.378125 19.6125 1.0 0.75 1.5 3 4 5 28.200 107.700 102.0 4.10 1.826500 16.8000 0.5 1.00 2.0 4 6 3 19.750 241.500 107.5 2.92 3.337500 19.8300 1.0 0.00 1.0 5 6 4 19.750 163.800 116.5 3.91 3.093750 17.6700 0.5 0.50 4.0 6 6 5 19.700 145.000 175.0 3.62 2.770000 15.5000 0.0 1.00 6.0
Here, we removed the car
column because we do not want any non-numeric values in the value
column of mtcarsMelt
, or else it will cause errors when trying to calculate the mean.
If you have any questions, feel free to leave a comment or reach out to me on Twitter.