dplyr
is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. Here, I will provide a basic overview of some of the most useful functions contained in the package.
For this article, I will be using the airquality
dataset from the datasets
package. The airquality
dataset contains information about air quality measurements in New York from May 1973 – September 1973.
The head of the dataset looks like this:
head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
Before we dive into the functions, let’s load up the two packages:
library(datasets) library(dplyr)
Okay, now let’s get to the functions.
Filter
The filter function will return all the rows that satisfy a following condition. For example, we can return all rows with Temp greater than 70 as follows:
filter(airquality, Temp > 70) Ozone Solar.R Wind Temp Month Day 1 36 118 8.0 72 5 2 2 12 149 12.6 74 5 3 3 7 NA 6.9 74 5 11 4 11 320 16.6 73 5 22 5 45 252 14.9 81 5 29 6 115 223 5.7 79 5 30 ...
We can specify multiple conditions. The example below will return all rows with Temp larger than 80 and Month higher than 5.
filter(airquality, Temp > 80 & Month > 5) Ozone Solar.R Wind Temp Month Day 1 NA 186 9.2 84 6 4 2 NA 220 8.6 85 6 5 3 29 127 9.7 82 6 7 4 NA 273 6.9 87 6 8 5 71 291 13.8 90 6 9 6 39 323 11.5 87 6 10 ...
Mutate
Mutate is used to add new variables to the data. For example, let’s adds a new column that displays the temperature in Celsius.
mutate(airquality, TempInC = (Temp - 32) * 5 / 9) Ozone Solar.R Wind Temp Month Day TempInC 1 41 190 7.4 67 5 1 19.44444 2 36 118 8.0 72 5 2 22.22222 3 12 149 12.6 74 5 3 23.33333 4 18 313 11.5 62 5 4 16.66667 5 NA NA 14.3 56 5 5 13.33333 ...
Summarise
The summarise function is used to summarise multiple values into a single value. It is very powerful when used in conjunction with the other functions in the dplyr
package, as demonstrated below. na.rm = TRUE
will remove all NA
values while calculating the mean, so that it doesn’t produce spurious results.
summarise(airquality, mean(Temp, na.rm = TRUE)) mean(Temp) 1 77.88235
Group By
The group_by
function is used to group data by one or more variables. For example, we can group the data together based on the Month, and then use the summarise function to calculate and display the mean temperature for each month.
summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE)) Month mean(Temp) 1 5 65.54839 2 6 79.10000 3 7 83.90323 4 8 83.96774 5 9 76.90000
Sample
The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 15 rows (10% of the original 153 rows) from the dataset.
sample_n(airquality, size = 10) sample_frac(airquality, size = 0.1)
Count
The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:
count(airquality, Month) Month n 1 5 31 2 6 30 3 7 31 4 8 31 5 9 30
This means that there are 31 rows with Month = 5, 30 rows with Month = 6, and so on.
Arrange
The arrange function is used to arrange rows by variables. Currently, the airquality dataset is arranged based on Month, and then Day. We can use the arrange function to arrange the rows in the descending order of Month, and then in the ascending order of Day.
arrange(airquality, desc(Month), Day) Ozone Solar.R Wind Temp Month Day 1 96 167 6.9 91 9 1 2 78 197 5.1 92 9 2 3 73 183 2.8 93 9 3 4 91 189 4.6 93 9 4 5 47 95 7.4 87 9 5 6 32 92 15.5 84 9 6
Pipe
The pipe operator in R, represented by %>%
can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.
For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month. The conventional way to write the code for this would be:
filteredData <- filter(airquality, Month != 5) groupedData <- group_by(filteredData, Month) summarise(groupedData, mean(Temp, na.rm = TRUE))
With piping, the above code can be rewritten as:
airquality %>% filter(Month != 5) %>% group_by(Month) %>% summarise(mean(Temp, na.rm = TRUE))
This is a very basic example, and the usefulness may not be very apparent, but as the number of operations/functions perfomed on the data increase, the pipe operator becomes more and more useful!
That brings us to the end of this article. I hope you enjoyed it and found it useful. If you have questions, feel free to leave a comment or reach out to me on Twitter.