An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.

Programming

In Part 1 of this series, we got started by looking at the `ts`

object in R and how it represents time series data. In Part 2, I’ll discuss some of the many time series transformation functions that are available in R. This is by no means an exhaustive catalog. If you feel I left out anything important, please let me know. I compile these posts as a guide in RMarkdown which I plan to make available on the web soon.

Often in time series analysis and modeling, we will want to transform data. There are a number of different functions that can be used to transform time series data such as the difference, log, moving average, percent change, lag, or cumulative sum. These type of function are useful for both visualizing time series data and for modeling time series. For example, the moving average function can be used to more easily visualize a high-variance time series and is also a critical part the ARIMA family of models. Functions such as the difference, percent change, and log difference are helpful for making non-stationary data stationary.

Let’s start with a very common operation. To take a lag we use the `lag()`

function from the `stats`

package. It takes a `ts`

object and a value for the `lag`

argument.

tseries_lag1 <- lag(tseries, 1) head(cbind(tseries, tseries_lag1))tseries tseries_lag1 1999 Q4 NA 7.07667 2000 Q1 7.07667 10.38876 2000 Q2 10.38876 23.91096 2000 Q3 23.91096 14.49356 2000 Q4 14.49356 15.90501 2001 Q1 15.90501 28.00545

Notice that when we do this, `tseries_lag1`

is equivalent to `tseries`

offset by 1, such that the value for 2000q1 in `tseries`

become the value for 1999q4 in `tseries_lag1`

and so on. We can do this for any lag of our choosing. Let’s try 3 instead:

tseries_lag3 <- lag(tseries, 3) head(cbind(tseries, tseries_lag3))tseries tseries_lag3 1999 Q2 NA 7.07667 1999 Q3 NA 10.38876 1999 Q4 NA 23.91096 2000 Q1 7.07667 14.49356 2000 Q2 10.38876 15.90501 2000 Q3 23.91096 28.00545

We can also do this in the opposite direction using a negative value for the `lag`

argument, effectively creating a lead series instead:

tseries_lead1 <- lag(tseries, -1) head(cbind(tseries, tseries_lead1))tseries tseries_lead1 2000 Q1 7.07667 NA 2000 Q2 10.38876 7.07667 2000 Q3 23.91096 10.38876 2000 Q4 14.49356 23.91096 2001 Q1 15.90501 14.49356 2001 Q2 28.00545 15.90501

Notice that `tseries_lead1`

now leads the original `tseries`

by one time period.

We can take differences of a time series as well. This is equivalent to taking the difference between each value and its previous value:

tseries_diff1 <- diff(tseries, lag = 1) tm <- cbind(tseries, tseries_diff1) head(tm)tseries tseries_diff1 [1,] 7.07667 NA [2,] 10.38876 3.312087 [3,] 23.91096 13.522201 [4,] 14.49356 -9.417399 [5,] 15.90501 1.411455 [6,] 28.00545 12.100441

plot.ts(tm)

As you can see, taking a difference is an effective way to remove a trend and make a time series stationary.

The difference can be calculated over more than one lag. To do so, simply alter the value of the `lag`

argument.

tseries_diff2 <- diff(tseries, lag = 2) tm <- cbind(tseries, tseries_diff2) head(tm)tseries tseries_diff2 2000 Q1 7.07667 NA 2000 Q2 10.38876 NA 2000 Q3 23.91096 16.834288 2000 Q4 14.49356 4.104801 2001 Q1 15.90501 -8.005944 2001 Q2 28.00545 13.511896

plot(tm)

Some time series transformation functions are useful for series in which the variance gets larger over time. These range from the basic logarithm function to the Box-Cox group of transformations (of which the natural logarithm is a special case). We can take the log of a time series using the log function in the same way that we would take the log of a vector. Let’s generate a time series that has increasing variance:

trend <- ts(seq(from = 10, to = 110)) cycle <- ts(sin(trend)) * 0.2 * trend tseries_h <- trend + cycle plot.ts(tseries_h)

As we’ll discuss in more detail, the prevalence of increasing, or more generally non-constant, variance is called heteroskedasticity and can cause problems in linear regression. Often, it will need to be corrected before modeling. One way to do this is taking a log:

tseries_log <- log(tseries_h) tm <-cbind(tseries_h, tseries_log) plot.ts(tm)

There are other more advanced ways of eliminating non-constant variance, one of which is the Box-Cox transformation, which allows us a bit more control over the transformation. The Box-Cox takes the form (Hyndman and Athanasopoulos, 2013):

$$w_t = \begin{cases} \log{ y_t }, \text{ if } \lambda = 0; \\ \frac{ ({y_t}^{\lambda} – 1) }{ \lambda },\text{ otherwise } \\ \end{cases} $$

In order to demonstrate the Box-Cox transformation, I’ll introduce Rob Hyndman’s `forecast` package:

library(forecast)

It has two functions that are of use here. The primary function is `BoxCox()`

, which will return a transformed time series given a time series and a value for the parameter `lambda`

:

plot.ts(BoxCox(tseries_h, lambda = 0.5))

Notice that this value of `lambda`

here does not entirely take care of the heteroskedasticity problem. We can experiment with different values of `lambda`

, or we can use the `BoxCox.lambda()`

function, which will provide us an optimal value for parameter `lambda`

:

lambda <- BoxCox.lambda(tseries_h) print(lambda)[1] 0.05527724

plot.ts(BoxCox(tseries_h, lambda = lambda))

The `BoxCox.lambda()`

function has chosen the value 0.055. If we then use this value in our `BoxCox()`

function, it returns a time series that appears to have constant variance.

Another common calculation that we may want to perform on time series is the percent change from one period to another. Because it seems that R does not have a function for performing this calculation, we can do it ourselves.

tseries_pch <- tseries / lag(tseries, -1) - 1 head(cbind(tseries, tseries_pch))tseries tseries_pch 2000 Q1 7.07667 NA 2000 Q2 10.38876 0.46802901 2000 Q3 23.91096 1.30161865 2000 Q4 14.49356 -0.39385287 2001 Q1 15.90501 0.09738501 2001 Q2 28.00545 0.76079409

We can turn this into a function so that we can easily calculate percent change in the future. We’ll add an argument to specify the number of periods over which we want to calculate the change and set it to 1 by default. Additionally, we’ll add some argument verification.

pch <- function(data, lag = 1) { # argument verification if (!is.ts(data)) stop("data must be of type ts") if (!is.numeric(lag)) stop("lag must be of type numeric") # return percent change data / lag(data, -lag) - 1 } head(cbind(tseries, pch(tseries)))tseries pch(tseries) 2000 Q1 7.07667 NA 2000 Q2 10.38876 0.46802901 2000 Q3 23.91096 1.30161865 2000 Q4 14.49356 -0.39385287 2001 Q1 15.90501 0.09738501 2001 Q2 28.00545 0.76079409

Now if we wished to calculate the percentage change over more than one period we can do so. Let’s say we wanted to calculate the year over year percent change. We can call our `pch()`

function with a lag of 4.

tseries_pchya <- pch(tseries, lag=4)

Two of the functions that we have discussed so far, the difference and the log, are often combined in time series analysis. The log difference function is useful for making non-stationary data stationary and has some other useful properties. We can calculate the log difference in R by simply combining the `log()`

and `diff()`

functions.

tseries_dlog <- ts(diff(log(tseries)), start = c(2000, 1), frequency = 4) plot.ts(cbind(tseries, tseries_dlog))

Notice that after taking the log return, `tseries`

appears to be stationary. We see some higher than normal volatility in the beginning of the series. This is due largely to the fact that the series levels start off so small. This leads into a nice property of the log return function, which is that it is a close approximation to the percent change:

plot.ts(cbind(pch(tseries), tseries_dlog))

This similarity is only approximate. The relationship does break down somewhat when the percent change from one period to the next is particularly large. You can read a good discussion of this topic here.

A moving average is another essential function for working with time series. For series with particularly high volatility, a moving average can help us to more clearly visualize its trend. Unfortunately, base R does not (to my knowledge) have a convenient function for calculating the moving average of a time series directly. However, we can use base R’s `filter()`

function, which allows us to perform general linear filtering. We can set up the parameters of this function to be a moving average (Shumway and Stoffer, 2011). Here we apply the `filter()`

function to `tseries`

to create a 5 period moving average. The `filter`

argument lets up specify the filter, which in this case is just a weighted average of 5 observations. The `sides`

argument allows us to specify whether we want to apply the filter over past values (`sides`

= 1), or to both past and future values (`sides`

= 2). Here I set `sides`

to 1 indicating that I want a trailing moving average:

tseries_lf5 <- filter(tseries, filter = rep(1/5, 5), sides = 1) plot.ts(cbind(tseries, tseries_lf5), plot.type = 'single', col = c('black', 'red'))

The fact that we have to define the linear filter each time we use this function makes it a little cumbersome to use. If we don’t mind introducing a dependency to our code, we could use the `SMA()`

function from the `TTR`

package or the `ma()`

function from the `forecast`

package. The `SMA()`

function takes a `ts`

object and a value for `n`

– the window over which we want to calculate the moving average.

library(TTR) tseries_ma5 <- SMA(tseries, n = 5)

The `ma()`

function from the `forecast`

package also performs moving average calculations. We supply the time series and a value for the `order`

argument.

tseries_ma5fore <- ma(tseries, order = 5)

Let’s compare the results. The `SMA()`

function returns a trailing moving average where each value is the mean of the n most recent trailing values. This is equivalent to the results we get from using the `filter()`

function. The `ma()`

function from the `forecast`

package returns a centered moving average. In this case `tseries_ma5for`

is equal to the average of the current observation, the previous two observation, and the next two observations. Which one you use would depend on your application.

tseries tseries_lf5 tseries_ma5 tseries_ma5fore 2000 Q1 7.076670 NA NA NA 2000 Q2 10.388758 NA NA NA 2000 Q3 23.910958 NA NA 14.35499 2000 Q4 14.493559 NA NA 18.54075 2001 Q1 15.905014 14.35499 14.35499 20.50828 2001 Q2 28.005455 18.54075 18.54075 17.55500 2001 Q3 20.226413 20.50828 20.50828 17.49470 2001 Q4 9.144571 17.55500 17.55500 17.68977 2002 Q1 14.192030 17.49470 17.49470 18.00239 2002 Q2 16.880366 17.68977 17.68977 18.86085

That’s it for Part 2. I don’t know about you, but I am sick of working with generated data. I want to work with some real world data. In Part 3 we’ll do just that using Quandl. See you then.

- Hyndman, Rob J. and George Athanasopoulos (2013) Forecasting: Principles and Practice. https://www.otexts.org/fpp

Shumway, Robert H. and David S. Stoffer (2011) Time Series Analysis and Its Applications With R Examples. New York, NY: Springer.