DataScience+ We publish R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!
Data Management

Data Manipulation with reshape2

In this article, I will show you how you can use the reshape2 package to convert data from wide to long format and vice versa. It was written and is maintained by Hadley Wickham.

Long format vs Wide format

In wide format data, each column represents a different variable. For example, the mtcars dataset from the datasets package can be represented in wide format as:

# Wide format
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In long format data, one column contains all the possible variables, and then another column contains their respective values. The above data can be represented in long format as:

# Long format
  variable value
1      mpg  21.0
2      mpg  21.0
3      mpg  22.8
4      mpg  21.4
5      mpg  18.7
6      mpg  18.1
...
    variable value
347     carb     2
348     carb     2
349     carb     4
350     carb     6
351     carb     8
352     carb     2

Long format data can have more than two columns, especially when id variables are provided, as demonstrated below.

In practical applications, while wide format is more readable, long format is easier to analyze. Therefore, it is useful to know how to convert between the two.

There are two main functions in reshape2:

  • melt – convert data from wide format to long format
  • cast – convert data from long format to wide format

melt

The melt function is used to convert data from wide format to long format. Let us work with the mtcars dataset from the datasets package. The data is originally in the wide format and looks as shown above. We can melt it to long format as follows:

mtcars$car <- rownames(mtcars)
mtcarsMelt <- melt(mtcars)
head(mtcarsMelt)

  cyl               car variable value
1   6         Mazda RX4      mpg  21.0
2   6     Mazda RX4 Wag      mpg  21.0
3   4        Datsun 710      mpg  22.8
4   6    Hornet 4 Drive      mpg  21.4
5   8 Hornet Sportabout      mpg  18.7
6   6           Valiant      mpg  18.1

Here, melt automatically selected car and cyl as id variables. If you want to use a different set of id variables, they can be specified using id.vars. We can also rename the variable and value columns using the variable.name and value.name parameters. For example, if we want to classify all the cars by the number of cylinders and gears, we can do as follows:

mtcarsMelt <- melt(mtcars, id.vars = c('cyl', 'gear'), variable.name = 'carVariable', value.name = 'carValue')
head(mtcarsMelt)

  cyl gear carVariable carValue
1   6    4         mpg       21
2   6    4         mpg       21
3   4    4         mpg     22.8
4   6    3         mpg     21.4
5   8    3         mpg     18.7
6   6    3         mpg     18.1

tail(mtcarsMelt)

    cyl gear carVariable       carValue
315   4    5         car  Porsche 914-2
316   4    5         car   Lotus Europa
317   8    5         car Ford Pantera L
318   6    5         car   Ferrari Dino
319   8    5         car  Maserati Bora
320   4    4         car     Volvo 142E

Generally, it is a good idea to use a combination of variables which can uniquely identify each data point as id variables. Hence, it is not a good idea to use cyl and gear since there are multiple data points which have the same value for the two. This will lead to problems when trying to convert the data back to wide format.

cast

The cast function is used to convert data from long format to wide format. There are two main types of the cast function:

  • dcast – returns a dataframe as the output
  • acast – returns a vector/matrix/array as the output

Since dataframe objects are the most common, I will demonstrate how to use dcast. We can convert data back to wide format from long format as follows:

mtcarsMelt <- melt(mtcars)
mtcarsCast <- dcast(mtcarsMelt, car + cyl ~ variable)
head(mtcarsCast)

                 car cyl  mpg disp  hp drat    wt  qsec vs am gear carb
1        AMC Javelin   8 15.2  304 150 3.15 3.435 17.30  0  0    3    2
2 Cadillac Fleetwood   8 10.4  472 205 2.93 5.250 17.98  0  0    3    4
3         Camaro Z28   8 13.3  350 245 3.73 3.840 15.41  0  0    3    4
4  Chrysler Imperial   8 14.7  440 230 3.23 5.345 17.42  0  0    3    4
5         Datsun 710   4 22.8  108  93 3.85 2.320 18.61  1  1    4    1
6   Dodge Challenger   8 15.5  318 150 2.76 3.520 16.87  0  0    3    2

The dcast function uses a formula to convert the data to wide format. Here, the formula is given by car + cyl ~ variable where car and cyl are the id variables, and variable is the name of the variable column.

Let us see what would happen if we had used cyl and gear as id variables instead.

mtcarsCast <- dcast(mtcarsMelt, cyl + gear ~ variable)
head(mtcarsCast)

  cyl gear mpg disp hp drat wt qsec vs am carb car
1   4    3   1    1  1    1  1    1  1  1    1   1
2   4    4   8    8  8    8  8    8  8  8    8   8
3   4    5   2    2  2    2  2    2  2  2    2   2
4   6    3   2    2  2    2  2    2  2  2    2   2
5   6    4   4    4  4    4  4    4  4  4    4   4
6   6    5   1    1  1    1  1    1  1  1    1   1

We get a warning that says: Aggregation function missing: defaulting to length. The dataset just shows the total number of observations for each combination of cyl and gear. This is because the dcast function is unable to uniquely identify each data point. However, it still has some uses. For example, we can find the mean of all variables for each combination of cyl and gear by using the fun.aggregate parameter as follows:

mtcars$car <- NULL
mtcarsMelt <- melt(mtcars, id.vars = c('cyl', 'gear'))
mtcarsCast <- dcast(mtcarsMelt, cyl + gear ~ variable, fun.aggregate = mean)
head(mtcarsCast)

  cyl gear    mpg    disp    hp drat       wt    qsec  vs   am carb
1   4    3 21.500 120.100  97.0 3.70 2.465000 20.0100 1.0 0.00  1.0
2   4    4 26.925 102.625  76.0 4.11 2.378125 19.6125 1.0 0.75  1.5
3   4    5 28.200 107.700 102.0 4.10 1.826500 16.8000 0.5 1.00  2.0
4   6    3 19.750 241.500 107.5 2.92 3.337500 19.8300 1.0 0.00  1.0
5   6    4 19.750 163.800 116.5 3.91 3.093750 17.6700 0.5 0.50  4.0
6   6    5 19.700 145.000 175.0 3.62 2.770000 15.5000 0.0 1.00  6.0

Here, we removed the car column because we do not want any non-numeric values in the value column of mtcarsMelt, or else it will cause errors when trying to calculate the mean.

If you have any questions, feel free to leave a comment or reach out to me on Twitter.