Efficient aggregation (and more) using data.table

In my recent post I have written about the aggregate function in base R and gave some examples on its use. This post repeats the same examples using data.table instead, the most efficient implementation of the aggregation logic in R, plus some additional use cases showing the power of the data.table package.

This post focuses on the aggregation aspect of the data.table and only touches upon all other uses of this versatile tool. For a great resource on everything data.table, head to the authors’ own free training material.

All code snippets below require the data.table package to be installed and loaded:

install.packages("data.table")
library(data.table)

Basic examples

Here is the example for the number of appearances of the unique values in the data:

values <- data.table(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))
values
nr.of.appearances <- values[, list(nr.appearances=length(value)), 
                                by = list(unique.values = value)]
nr.of.appearances
    value
 1:     a
 2:     a
 3:     a
.....

   unique.values nr.appearances
1:             a              5
2:             b              3
3:             c              4

You can notice a lot of differences here. First of all, no additional function was invoke. Instead, the [] operator has been overloaded for the data.table class allowing for a different signature: it has three inputs instead of the usual two for a data.frame. We will return to this in a moment. Secondly, the columns of the data.table were not referenced by their name as a string, but as a variable instead. This is a very important aspect of the data.table syntax. Last but not least as implied by the fact that both the aggregating function and the grouping variable are passed on as a list one can not only group by multiple variables as in aggregate but you can also use multiple aggregation functions at the same time. I will show an example of that later.

Coming back to the overloading of the [] operator: a data.table is at the same time also a data.frame. See e.g.

class(values)
"data.table" "data.frame"

This means that you can use all (or at least most of) the data.frame functionality as well. Among others you can use aggregate like you would use for a data.frame:

values <- data.frame(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))

nr.of.appearances <- aggregate(x = values, 
                               by = list(unique.values = values$value), 
                               FUN = length)

EDIT (02/12/2015): Matt Dowle from the data.table team suggested a more efficient implementation for this in the comments (thanks, Matt!):

nr.of.appearances <- values[, list(nr.appearances=.N), 
                            by = list(unique.values = value)]

You can also use the [] operator in the classic data.frame way by passing on only two input variables:

values[values$value == "a",]
   value
1:     a
2:     a
3:     a
...

UPDATE 02/12/2015
Matt Dowle from the data.table team warned in the comments against this way of filtering a data.table and suggested an alternative (thanks, Matt!):

values[value=="a",]
   value
1:     a
2:     a
3:     a
...

Another exciting possibility with data.table is creating a new column in a data.table derived from existing columns – with or without aggregation. Examples of both are shown below:

values[, new.col := paste0(value, value)]
values
values[, new.col := paste0(value, length(value)), by = list(unique.values = value)]
values
    value new.col
 1:     a      aa
 2:     a      aa
 3:     a      aa
 4:     a      aa
.....

    value new.col
 1:     a      a5
 2:     a      a5
 3:     a      a5
 4:     a      a5
.....

Notice that in both cases the data.table was directly modified, rather than left unchanged with the results returned. That’s right: data.table creates side effect by using copy-by-reference rather than copy-by-value as (almost) everything else in R. It is arguable whether this is alien to the nature of a (more or less) functional language like R but one thing is sure: it is extremely efficient, especially when the variable hardly fits the memory to start with.
Back to the basic examples, here is the last (and first) day of the months in your data

dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729)
dates
dates <- as.data.table(dates)
dates
special.days <- dates[, list(first.day = min(date), last.day = max(date)), 
                           by = list(month = substr(date, 1, 7))]
special.days
      date
1   2001-01-01
2   2001-01-02
3   2001-01-03
4   2001-01-04
.....
         date
1: 2001-01-01
2: 2001-01-02
3: 2001-01-03
 ---           
726: 2002-12-27
727: 2002-12-28
728: 2002-12-29

      month  first.day   last.day
 1: 2001-01 2001-01-01 2001-01-31
 2: 2001-02 2001-02-01 2001-02-28
 3: 2001-03 2001-03-01 2001-03-31
.....

As you can see the syntax is the same as above – but now we can get the first and last days in a single command! Also note that you don’t have to know up front that you want to use data.table: the as.data.table command allows you to cast a data.frame into a data.table. Finally, notice how data.table creates a summary of the head and the tail of the variable if it’s too long to show.

Advanced Uses

Just like in case of aggregate, you can use anonymous functions to aggregate in data.table as well. Let’s have a look at the example for fitting a Gaussian distribution to observations by categories:

library(MASS)

categories <- data.table(category = c("a", "a", "a", "a", "a", 
                                      "b", "b", "b", "b", "b",
                                      "c", "c", "c", "c"))

observations <- data.table(observation = c(rnorm(5, mean = 3, sd = 0.2),
                                           rnorm(5, mean = -2, sd = 0.4),
                                           rnorm(4, mean = 0, sd = 1)))

data <- cbind(categories, observations)
data
distr.estimate <- data[,
    list(mean = fitdistr(observation, densfun = "normal")$estimate[[1]],
         sd = fitdistr(observation, densfun = "normal")$estimate[[2]]),
    by = list(category)]

distr.estimate
 category observation
 1:        a   2.7446816
 2:        a   2.8853469
 3:        a   2.7550775
.....

   category       mean         sd
1:        a  2.8332705 0.06882552
2:        b -1.9678460 0.37420857
3:        c  0.9233108 0.47680978

or equivalently

distr.estimate <- data[, 
  list(mean = fitdistr(observation, densfun = "normal")$estimate,
       sd = fitdistr(observation, densfun = "normal")$estimate[[2]]),
  by = list(category)]

This example shows some weaknesses of using data.table compared to aggregate, but it also shows that those weaknesses are nicely balanced by the strength of data.table. One such weakness is that by design data.table aggregation requires the variables to be coming from the same data.table, so we had to cbind the two variables. Also, the aggregation in data.table returns only the first variable if the function invoked returns more than variable, hence the equivalence of the two syntaxes showed above. However, as multiple calls can be submitted in the list, this can easily be overcome. Finally note how much simpler the anonymous function construction works: rather than defining the function itself, we can simply pass the relevant variable.

UPDATE 02/12/2015
As kindly noted by Jan Gorecki in the comments (thanks, Jan!), the weakness I mention above can be overcome by using the {} operator for the inut variable j:

distr.estimate <- data[, 
   {est <- fitdistr(observation, 
                    densfun ="normal")$estimate;
    list(mean = est[[1]], 
         sd = est[[2]])}, 
   by = list(category)]

distr.estimate
   category       mean         sd
1:        a  2.8332705 0.06882552
2:        b -1.9678460 0.37420857
3:        c  0.9233108 0.47680978

Notice that as opposed to the anonymous function definition in aggregate, you don’t have to use the return() command, data.table simply returns with the result of the last command.

If you have any question about this post please leave a comment below.

4 Comments

MD

Matt Dowle November 30, 2015

+1 Nice article.

Glad Jan already highlighted passing an anonymous body to j. More info on that is in FAQ 2.8 here : https://github.com/Rdatatable/data.table/wiki/vignettes/datatable-faq.pdf

A few other minor points …

We tend to use .N rather than length(value). This i) saves having to pick a column name to pass to length() and ii) is more efficient since that column doesn’t then need to be materialized by data.table internally.

We don’t write values[values$value == “a”,] but rather just values[value==”a”,]. The reasons to avoid the risk of symbol repetition are discussed here: http://stackoverflow.com/a/10758086/403310. [Btw, that question is the 2nd highest voted R question on Stack Overflow, out of 115k.]

.() is borrowed from Hadley as an alias for list() so you’ll often see .() in examples elsewhere.
1. DK
  
  David Kun December 2, 2015
  
  Thanks, Matt, great comment. I will discuss with the DSP guys if I can update the post to incorporate this.
J

jangorecki November 28, 2015

Nice article. Re `$estimate[[]]`- you can use `{` function call to `j`
`[.data.table` argument, and reuse elements from the single `fitdistr` call results to build a list that will be used in `j` as last expression in `{` function call.
1. DK
  
  David Kun November 29, 2015
  
  Thanks, @jangorecki, great tip. Just for the audience, in code it looks as follows:
  
  distr.estimate <-
  data[,
  {est <- fitdistr(observation,
  densfun ="normal")$estimate;
  list(mean = est[[1]],
  sd = est[[2]])
  },
  by = list(category)]
  
  Indeed this is much closer to anonymous function application and in case of complex transformations makes life a lot easier.
  
  Thanks again

Efficient aggregation (and more) using data.table

Basic examples

Advanced Uses

4 Comments

Leave a comment Cancel reply

More in Data Management

Imputing Missing Data in R: mice, missRanger, and VIM Compared

From Wide to Long: Reshaping World Bank Data with pivot_longer

How to scrape the FOMC’s economic projections and replicate its Dot Plot in Python