Efficient aggregation (and more) using data.table

In my recent post I have written about the aggregate function in base R and gave some examples on its use. This post repeats the same examples using data.table instead, the most efficient implementation of the aggregation logic in R, plus some additional use cases showing the power of the data.table package.

This post focuses on the aggregation aspect of the data.table and only touches upon all other uses of this versatile tool. For a great resource on everything data.table, head to the authors’ own free training material.

All code snippets below require the data.table package to be installed and loaded:

install.packages("data.table")
library(data.table)

Basic examples

Here is the example for the number of appearances of the unique values in the data:

values <- data.table(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))
values
nr.of.appearances <- values[, list(nr.appearances=length(value)), 
                                by = list(unique.values = value)]
nr.of.appearances
    value
 1:     a
 2:     a
 3:     a
.....

   unique.values nr.appearances
1:             a              5
2:             b              3
3:             c              4

You can notice a lot of differences here. First of all, no additional function was invoke. Instead, the [] operator has been overloaded for the data.table class allowing for a different signature: it has three inputs instead of the usual two for a data.frame. We will return to this in a moment. Secondly, the columns of the data.table were not referenced by their name as a string, but as a variable instead. This is a very important aspect of the data.table syntax. Last but not least as implied by the fact that both the aggregating function and the grouping variable are passed on as a list one can not only group by multiple variables as in aggregate but you can also use multiple aggregation functions at the same time. I will show an example of that later.

Coming back to the overloading of the [] operator: a data.table is at the same time also a data.frame. See e.g.

class(values)
"data.table" "data.frame"

This means that you can use all (or at least most of) the data.frame functionality as well. Among others you can use aggregate like you would use for a data.frame:

values <- data.frame(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))

nr.of.appearances <- aggregate(x = values, 
                               by = list(unique.values = values$value), 
                               FUN = length)

EDIT (02/12/2015): Matt Dowle from the data.table team suggested a more efficient implementation for this in the comments (thanks, Matt!):

nr.of.appearances <- values[, list(nr.appearances=.N), 
                            by = list(unique.values = value)]

You can also use the [] operator in the classic data.frame way by passing on only two input variables:

values[values$value == "a",]
   value
1:     a
2:     a
3:     a
...

UPDATE 02/12/2015
Matt Dowle from the data.table team warned in the comments against this way of filtering a data.table and suggested an alternative (thanks, Matt!):

values[value=="a",]
   value
1:     a
2:     a
3:     a
...

Another exciting possibility with data.table is creating a new column in a data.table derived from existing columns – with or without aggregation. Examples of both are shown below:

values[, new.col := paste0(value, value)]
values
values[, new.col := paste0(value, length(value)), by = list(unique.values = value)]
values
    value new.col
 1:     a      aa
 2:     a      aa
 3:     a      aa
 4:     a      aa
.....

    value new.col
 1:     a      a5
 2:     a      a5
 3:     a      a5
 4:     a      a5
.....

Notice that in both cases the data.table was directly modified, rather than left unchanged with the results returned. That’s right: data.table creates side effect by using copy-by-reference rather than copy-by-value as (almost) everything else in R. It is arguable whether this is alien to the nature of a (more or less) functional language like R but one thing is sure: it is extremely efficient, especially when the variable hardly fits the memory to start with.
Back to the basic examples, here is the last (and first) day of the months in your data

dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729)
dates
dates <- as.data.table(dates)
dates
special.days <- dates[, list(first.day = min(date), last.day = max(date)), 
                           by = list(month = substr(date, 1, 7))]
special.days
      date
1   2001-01-01
2   2001-01-02
3   2001-01-03
4   2001-01-04
.....
         date
1: 2001-01-01
2: 2001-01-02
3: 2001-01-03
 ---           
726: 2002-12-27
727: 2002-12-28
728: 2002-12-29

      month  first.day   last.day
 1: 2001-01 2001-01-01 2001-01-31
 2: 2001-02 2001-02-01 2001-02-28
 3: 2001-03 2001-03-01 2001-03-31
.....

As you can see the syntax is the same as above – but now we can get the first and last days in a single command! Also note that you don’t have to know up front that you want to use data.table: the as.data.table command allows you to cast a data.frame into a data.table. Finally, notice how data.table creates a summary of the head and the tail of the variable if it’s too long to show.

Advanced Uses

Just like in case of aggregate, you can use anonymous functions to aggregate in data.table as well. Let’s have a look at the example for fitting a Gaussian distribution to observations by categories:

library(MASS)

categories <- data.table(category = c("a", "a", "a", "a", "a", 
                                      "b", "b", "b", "b", "b",
                                      "c", "c", "c", "c"))

observations <- data.table(observation = c(rnorm(5, mean = 3, sd = 0.2),
                                           rnorm(5, mean = -2, sd = 0.4),
                                           rnorm(4, mean = 0, sd = 1)))

data <- cbind(categories, observations)
data
distr.estimate <- data[,
    list(mean = fitdistr(observation, densfun = "normal")$estimate[[1]],
         sd = fitdistr(observation, densfun = "normal")$estimate[[2]]),
    by = list(category)]

distr.estimate
 category observation
 1:        a   2.7446816
 2:        a   2.8853469
 3:        a   2.7550775
.....

   category       mean         sd
1:        a  2.8332705 0.06882552
2:        b -1.9678460 0.37420857
3:        c  0.9233108 0.47680978

or equivalently

distr.estimate <- data[, 
  list(mean = fitdistr(observation, densfun = "normal")$estimate,
       sd = fitdistr(observation, densfun = "normal")$estimate[[2]]),
  by = list(category)]

This example shows some weaknesses of using data.table compared to aggregate, but it also shows that those weaknesses are nicely balanced by the strength of data.table. One such weakness is that by design data.table aggregation requires the variables to be coming from the same data.table, so we had to cbind the two variables. Also, the aggregation in data.table returns only the first variable if the function invoked returns more than variable, hence the equivalence of the two syntaxes showed above. However, as multiple calls can be submitted in the list, this can easily be overcome. Finally note how much simpler the anonymous function construction works: rather than defining the function itself, we can simply pass the relevant variable.

UPDATE 02/12/2015
As kindly noted by Jan Gorecki in the comments (thanks, Jan!), the weakness I mention above can be overcome by using the {} operator for the inut variable j:

distr.estimate <- data[, 
   {est <- fitdistr(observation, 
                    densfun ="normal")$estimate;
    list(mean = est[[1]], 
         sd = est[[2]])}, 
   by = list(category)]

distr.estimate
   category       mean         sd
1:        a  2.8332705 0.06882552
2:        b -1.9678460 0.37420857
3:        c  0.9233108 0.47680978

Notice that as opposed to the anonymous function definition in aggregate, you don’t have to use the return() command, data.table simply returns with the result of the last command.

If you have any question about this post please leave a comment below.

Data FramesData Manipulation