DataScience+ We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!
Data Management

Aggregate – A Powerful Tool for Data Frame in R

This post gives a short review of the aggregate function as used for data.frames and presents some interesting uses: from the trivial but handy to the most complicated problems I have solved with aggregate.

Aggregate (data.frame): Technical Overview

Aggregate is a function in base R which can, as the name suggests, aggregate the inputted data.frame d.f by applying a function specified by the FUN parameter to each column of sub-data.frames defined by the by input parameter.

The by parameter has to be a list. However, since data.frame‘s are handled as (named) lists of columns, one or more columns of a data.frame can also be passed as the by parameter. Interestingly, if these columns are of the same data.frame as the one inputted as x, those columns are not passed on to the FUN function.

The function to apply has to be able to accept a vector (since it will be called with parts of a column of a data.frame as input).

The sub-data.frames defined by the by input parameter can be thought of as logical indexing:

d.f <- data.frame(rating = c("AAA", "A", "A", "AAA", 
                             "BB", "BB", "AAA", "A"))
i <- 1
by <- d.f$rating <- d.f[by == unique(by)[i], ]

and do this for every i between 1 and length(unique(by)). Note that the by variable doesn’t have to agree with one (or more) column of the data.frame but could be anything. Hence, one can reproduce the aggregate functionality by a for cycle running the cycle variable over the unique values of the variable passed as by and an sapply applying the function passed as FUN to each column of the data.frame Such a workaround however would be very difficult to document, as it would be unclear what (and why) this code is actually doing.

Aggregate always returns a data.frame as a result. This data.frame will contain the (now unique) values from the input parameter by as the first column and then columns containing the results of the call to the function in the FUN parameter applied to the parts of the columns of the inputted data.frame. It is interesting to note that if the function FUN returns multiple values, the class of the columns of the result data.frame will be list or something a list can be cast to (see the last example below).

It is important to note that the function call is applied to nameless vectors rather than named columns of a data.frame and hence referring to the names of the data.frame will not work, nor will column references such as s.d.f[,1].

Basic Examples

The most basic uses of aggregate involve base functions such as mean and sd. It is indeed one of the most common uses of aggregate to compare the mean or other properties of sample groups.

Recently I reproduced calculations from an Excel sheet. Most formulae were subtotals and grand totals. The Excel sheet was not very comfortably organized for this purpose: sums over rows, columns and totals of those sums were used. In R, I have changed the data to a star schema representation (when all metadata are represented row-wise and every value gets its own row) using reshape2 package and melt then used aggregate along different variables to get the different totals. The less variables you use in by the more aggregated the end-result: the grand total along a dimension is simply using that dimension as “by”, while subtotals can be achieved using multiple variables as by. The FUN in this case was of course sum.

One handy use of aggregate and a base function is getting the number of appearances of the various values:

values <- data.frame(value = c("a", "a", "a", "a", "a", 
                               "b", "b", "b", 
                               "c", "c", "c", "c"))
nr.of.appearances <- aggregate(x = values, 
                               by = list(unique.values = values$value), 
                               FUN = length)

My favourite use of aggregate with a base function is getting the last day of each month in a series of dates. To do so, one can use the following code (assuming your dates are stored in a “YYYY-MM-DD” format as strings or as Date):

dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729)
1   2001-01-01
2   2001-01-02
3   2001-01-03
4   2001-01-04
..... <- aggregate(x = dates["date"], 
                      by = list(month = substr(dates$date, 1, 7)), 
                      FUN = max) 
     month       date
1  2001-01 2001-01-31
2  2001-02 2001-02-28
3  2001-03 2001-03-31
4  2001-04 2001-04-30

This came in very handy when working with banking information where the last day of the month depended on banking holidays as well as weekends.

Advanced Uses

More advanced uses of aggregate depend on writing your own function, e.g. anonymous functions passed on as the FUN parameter. To do so, one can use the syntax

# do not run the syntax
aggregate(x = d.f, 
          by = by.list, 
          FUN = function(s.d.f){y <- s.d.f; return(y)}

The possible uses range from calling complex portfolio risk metrics for the homogeneous risk groups of a portfolio via fitting a distribution to categories of samples to anything you can image, really.

Here is an example with a “complex” portfolio risk metric (exposure to different counterparties in different asset classes):

assets <- data.frame(asset.class = c("equity", "equity","equity",
                                     "bond", "bond"),
                     rating = c("AAA", "A", "A", "AAA", 
                                "BB", "BB", "AAA", "A"),
                     counterparty.a = c(runif(3), rnorm(5)),
                     counterparty.b = c(runif(3), rnorm(5)),
                     counterparty.c = c(runif(3), rnorm(5)))
  asset.class rating counterparty.a counterparty.b counterparty.c
1      equity    AAA      0.9026004      0.6029417      0.8629453
2      equity      A      0.8834034      0.5809589      0.4654721
3      equity      A      0.1007586      0.9368537      0.3090811
4      option    AAA     -1.0508915      0.7171532      0.2224984

Here is the use of aggregate() function.

exposures <- aggregate(x = assets[c("counterparty.a", 
                       by = assets[c("asset.class", "rating")],
                       FUN = function(market.values){
                               sum(pmax(market.values, 0))
  asset.class rating counterparty.a counterparty.b counterparty.c
1        bond      A      1.0038714      0.6382029      2.2822936
2      equity      A      0.9841620      1.5178126      0.7745532
3        bond    AAA      0.0000000      0.0000000      0.0000000
4      equity    AAA      0.9026004      0.6029417      0.8629453

Next up: fitting a Gaussian distribution to observations by categories:

categories <- data.frame(category = c("a", "a", "a", "a", "a", 
                                      "b", "b", "b", "b", "b",
                                      "c", "c", "c", "c"))
observations <- data.frame(observation = c(rnorm(5, mean = 3, sd = 0.2),
                                           rnorm(5, mean = -2, sd = 0.4),
                                           rnorm(4, mean = 0, sd = 1)))

Below we use the aggregate() function to find the mean and standard deviation by categories.

distr.estimate <- aggregate(x = observations, 
                            by = categories,
                            FUN = function(observations){
                                              densfun = "normal")$estimate
  category observation.mean
1        a        3.0606926      0.1779962
2        b       -2.1446040      0.1658481
3        c       -0.1881841      0.5613013

This last example showcases several interesting properties. First, the data.frame to aggregate and the list of by variables don’t have to be the same. While this is implied in other places of the post, this is an explicit example of such a setup. Secondly, the function passed as FUN is not only an anonymous function, it is curried from a function with more than one input parameter. A function of a single input variable observations has been created from the two-input variable function fitdistr: fixing one of the input variables by setting densfun = "normal". Thirdly, rather than returning the full return value of the fitdistr function, the return value is restricted to the element estimate from the return value. And last but not least, the return value of the anonymous function passed to FUN consists of two variables and not only one. Interestingly, aggregate casts the return value from list to a matrix and names the elements for us. However, these names can’t be used to reference the columns of the matrix. You can however reference them as follows:

[1] 3.016988

Closing Words

I hope that you have found the above useful. Now that you are more familiar with aggregate, it is time for the truth: everything above and much more can be done with data.table, and with a much faster performance. However, data.table has a complex syntax and one really has to understand how things work under the hood, while aggregate is simple and insightful. Until you are comfortable with both the logic of aggregation and the syntax of data.table, it is a worthy investment to first write the code using aggregate and then optimize it by rewriting it using data.table.

For those of you who are interested, a dedicated post is coming where the above is redone with data.table, along with some additional use cases specific to data.table.

If you have any question leave a comment below.