DataScience+ We publish R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!
Data Management

Data Manipulation with dplyr

dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. Here, I will provide a basic overview of some of the most useful functions contained in the package.

For this article, I will be using the airquality dataset from the datasets package. The airquality dataset contains information about air quality measurements in New York from May 1973 – September 1973.

The head of the dataset looks like this:

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Before we dive into the functions, let’s load up the two packages:

library(datasets)
library(dplyr)

Okay, now let’s get to the functions.

Filter

The filter function will return all the rows that satisfy a following condition. For example, we can return all rows with Temp greater than 70 as follows:

filter(airquality, Temp > 70)

  Ozone Solar.R Wind Temp Month Day
1    36     118  8.0   72     5   2
2    12     149 12.6   74     5   3
3     7      NA  6.9   74     5  11
4    11     320 16.6   73     5  22
5    45     252 14.9   81     5  29
6   115     223  5.7   79     5  30
...

We can specify multiple conditions. The example below will return all rows with Temp larger than 80 and Month higher than 5.

filter(airquality, Temp > 80 & Month > 5)

   Ozone Solar.R Wind Temp Month Day
1     NA     186  9.2   84     6   4
2     NA     220  8.6   85     6   5
3     29     127  9.7   82     6   7
4     NA     273  6.9   87     6   8
5     71     291 13.8   90     6   9
6     39     323 11.5   87     6  10
...

Mutate

Mutate is used to add new variables to the data. For example, let’s adds a new column that displays the temperature in Celsius.

mutate(airquality, TempInC = (Temp - 32) * 5 / 9)

  Ozone Solar.R Wind Temp Month Day  TempInC
1    41     190  7.4   67     5   1 19.44444
2    36     118  8.0   72     5   2 22.22222
3    12     149 12.6   74     5   3 23.33333
4    18     313 11.5   62     5   4 16.66667
5    NA      NA 14.3   56     5   5 13.33333
...

Summarise

The summarise function is used to summarise multiple values into a single value. It is very powerful when used in conjunction with the other functions in the dplyr package, as demonstrated below. na.rm = TRUE will remove all NA values while calculating the mean, so that it doesn’t produce spurious results.

summarise(airquality, mean(Temp, na.rm = TRUE))

  mean(Temp)
1   77.88235

Group By

The group_by function is used to group data by one or more variables. For example, we can group the data together based on the Month, and then use the summarise function to calculate and display the mean temperature for each month.

summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))

  Month mean(Temp)
1     5   65.54839
2     6   79.10000
3     7   83.90323
4     8   83.96774
5     9   76.90000

Sample

The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 15 rows (10% of the original 153 rows) from the dataset.

sample_n(airquality, size = 10)
sample_frac(airquality, size = 0.1)

Count

The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:

count(airquality, Month)

  Month  n
1     5 31
2     6 30
3     7 31
4     8 31
5     9 30

This means that there are 31 rows with Month = 5, 30 rows with Month = 6, and so on.

Arrange

The arrange function is used to arrange rows by variables. Currently, the airquality dataset is arranged based on Month, and then Day. We can use the arrange function to arrange the rows in the descending order of Month, and then in the ascending order of Day.

arrange(airquality, desc(Month), Day)

  Ozone Solar.R Wind Temp Month Day
1    96     167  6.9   91     9   1
2    78     197  5.1   92     9   2
3    73     183  2.8   93     9   3
4    91     189  4.6   93     9   4
5    47      95  7.4   87     9   5
6    32      92 15.5   84     9   6

Pipe

The pipe operator in R, represented by %>% can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month. The conventional way to write the code for this would be:

filteredData <- filter(airquality, Month != 5)
groupedData <- group_by(filteredData, Month)
summarise(groupedData, mean(Temp, na.rm = TRUE))

With piping, the above code can be rewritten as:

airquality %>% 
    filter(Month != 5) %>% 
    group_by(Month) %>% 
    summarise(mean(Temp, na.rm = TRUE))

This is a very basic example, and the usefulness may not be very apparent, but as the number of operations/functions perfomed on the data increase, the pipe operator becomes more and more useful!

That brings us to the end of this article. I hope you enjoyed it and found it useful. If you have questions, feel free to leave a comment or reach out to me on Twitter.