Every time I had to convert dataset from long to the wide format, I needed to Google for the solution. Although I found an answer every time, yet it was impossible to remember when I needed since I did not fully understand how transforming the dataset works. I prefer to use tidyverse package for this task, but I know that reshape
package works as well.
First, I create a dataset with 9 rows and 3 IDs in long format to take as an example.
library(tidyverse)
set.seed(519)
longdata1 <- data.frame(ID = 1:3,
expand.grid(Name = c("Dora", "John", "Rob"), Year = 2012:2014),
BMI = round(runif(9, 18, 35), 0)
)
longdata1
## ID Name Year BMI
## 1 1 Dora 2012 24
## 2 2 John 2012 33
## 3 3 Rob 2012 33
## 4 1 Dora 2013 34
## 5 2 John 2013 27
## 6 3 Rob 2013 27
## 7 1 Dora 2014 19
## 8 2 John 2014 25
## 9 3 Rob 2014 34
The dataset includes 3 people who have 3 different measurements of BMI for each year from 2012-2014. To transform the dataset longdata1
from long to wide is a simple task. In spread
function I put the variables I want to separate (Year, BMI).
Transform dataset from long to wide
wide = longdata1 %>%
spread(Year, BMI)
wide
## ID Name 2012 2013 2014
## 1 1 Dora 24 34 19
## 2 2 John 33 27 25
## 3 3 Rob 33 27 34
Below, I will convert dataset from wide to long with gather
function, in which I include variables I like to put in one column as Year, and besides that the column with the values of BMI.
Transform dataset from wide to long.
longdata2 = wide %>%
gather("2012", "2013", "2014", key = Year, value = BMI)
longdata2
## ID Name Year BMI
## 1 1 Dora 2012 24
## 2 2 John 2012 33
## 3 3 Rob 2012 33
## 4 1 Dora 2013 34
## 5 2 John 2013 27
## 6 3 Rob 2013 27
## 7 1 Dora 2014 19
## 8 2 John 2014 25
## 9 3 Rob 2014 34
Compare the orginal old long format dataset with the newly created
identical(longdata2$BMI, longdata1$BMI)
## [1] TRUE
Convert data from long to wide format with multiple columns
What I presented above is easy, because the dataset has only 2 variables for transforming, but often the dataset has more than 2 variable for conversion.
I will add new variable to the dataset
set.seed(520)
long3 <- data.frame(ID = 1:3,
expand.grid(Name = c("Dora", "John", "Rob"), Year = 2012:2014),
BMI = round(runif(9, 18, 35), 0),
Cholesterol = round(runif(9, 200, 300), 0)
)
long3
## ID Name Year BMI Cholesterol
## 1 1 Dora 2012 20 244
## 2 2 John 2012 26 299
## 3 3 Rob 2012 20 218
## 4 1 Dora 2013 19 227
## 5 2 John 2013 20 211
## 6 3 Rob 2013 28 204
## 7 1 Dora 2014 27 272
## 8 2 John 2014 21 277
## 9 3 Rob 2014 23 245
Confirming that the spread()
won't work
wide2 = long3 %>%
spread(Year, BMI)
wide2
## ID Name Cholesterol 2012 2013 2014
## 1 1 Dora 227 NA 19 NA
## 2 1 Dora 244 20 NA NA
## 3 1 Dora 272 NA NA 27
## 4 2 John 211 NA 20 NA
## 5 2 John 277 NA NA 21
## 6 2 John 299 26 NA NA
## 7 3 Rob 204 NA 28 NA
## 8 3 Rob 218 20 NA NA
## 9 3 Rob 245 NA NA 23
Or, the code spread(Year, BMI, Cholesterol)
will not work.
Now lets see the solution, step by step:
In order to use spread()
we need to have 2 variables in long format. So first, I will gather the data.
long3 %>%
gather("Year", "BMI", "Cholesterol", key = variable, value = number)
## ID Name variable number
## 1 1 Dora Year 2012
## 2 2 John Year 2012
## 3 3 Rob Year 2012
## 4 1 Dora Year 2013
## 5 2 John Year 2013
## 6 3 Rob Year 2013
## 7 1 Dora Year 2014
## 8 2 John Year 2014
## 9 3 Rob Year 2014
## 10 1 Dora BMI 20
## 11 2 John BMI 26
## 12 3 Rob BMI 20
## 13 1 Dora BMI 19
## 14 2 John BMI 20
## 15 3 Rob BMI 28
## 16 1 Dora BMI 27
## 17 2 John BMI 21
## 18 3 Rob BMI 23
## 19 1 Dora Cholesterol 244
## 20 2 John Cholesterol 299
## 21 3 Rob Cholesterol 218
## 22 1 Dora Cholesterol 227
## 23 2 John Cholesterol 211
## 24 3 Rob Cholesterol 204
## 25 1 Dora Cholesterol 272
## 26 2 John Cholesterol 277
## 27 3 Rob Cholesterol 245
The variable
has similar name which unable the spread function. So I need to create another variable which will identify each of them.
long3 %>%
group_by(ID) %>%
mutate(Visit = 1:n())
## # A tibble: 9 x 6
## # Groups: ID [3]
## ID Name Year BMI Cholesterol Visit
## <int> <fct> <int> <dbl> <dbl> <int>
## 1 1 Dora 2012 20 244 1
## 2 2 John 2012 26 299 1
## 3 3 Rob 2012 20 218 1
## 4 1 Dora 2013 19 227 2
## 5 2 John 2013 20 211 2
## 6 3 Rob 2013 28 204 2
## 7 1 Dora 2014 27 272 3
## 8 2 John 2014 21 277 3
## 9 3 Rob 2014 23 245 3
long3 %>%
group_by(ID) %>%
mutate(Visit = 1:n()) %>%
gather("Year", "BMI", "Cholesterol", key = variable, value = number)
## # A tibble: 27 x 5
## # Groups: ID [3]
## ID Name Visit variable number
## <int> <fct> <int> <chr> <dbl>
## 1 1 Dora 1 Year 2012
## 2 2 John 1 Year 2012
## 3 3 Rob 1 Year 2012
## 4 1 Dora 2 Year 2013
## 5 2 John 2 Year 2013
## 6 3 Rob 2 Year 2013
## 7 1 Dora 3 Year 2014
## 8 2 John 3 Year 2014
## 9 3 Rob 3 Year 2014
## 10 1 Dora 1 BMI 20
## # … with 17 more rows
Next step is to combine variables together and prepare for spread()
function
long3 %>%
group_by(ID) %>%
mutate(Visit = 1:n()) %>%
gather("Year", "BMI", "Cholesterol", key = variable, value = number) %>%
unite(combi, variable, Visit)
## # A tibble: 27 x 4
## # Groups: ID [3]
## ID Name combi number
## <int> <fct> <chr> <dbl>
## 1 1 Dora Year_1 2012
## 2 2 John Year_1 2012
## 3 3 Rob Year_1 2012
## 4 1 Dora Year_2 2013
## 5 2 John Year_2 2013
## 6 3 Rob Year_2 2013
## 7 1 Dora Year_3 2014
## 8 2 John Year_3 2014
## 9 3 Rob Year_3 2014
## 10 1 Dora BMI_1 20
## # … with 17 more rows
Each variable in combi
has unique name for same ID and is ready for spread.
long3 %>%
group_by(ID) %>%
mutate(Visit = 1:n()) %>%
gather("Year", "BMI", "Cholesterol", key = variable, value = number) %>%
unite(combi, variable, Visit) %>%
spread(combi, number)
## # A tibble: 3 x 11
## # Groups: ID [3]
## ID Name BMI_1 BMI_2 BMI_3 Cholesterol_1 Cholesterol_2 Cholesterol_3
## <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Dora 20 19 27 244 227 272
## 2 2 John 26 20 21 299 211 277
## 3 3 Rob 20 28 23 218 204 245
## # … with 3 more variables: Year_1 <dbl>, Year_2 <dbl>, Year_3 <dbl>
Now the long dataset with more than 2 columns is in wide format.