Imputation is a complex process that requires a good knowledge of your data. For example, it is crucial to know whether the missing is at random or not before you impute the data. I have read a nice tutorial which visualize the missing data and help to understand the type of missing, and another post showing how to impute the data with MICE package.

In this short post, I will focus on management of the missing data using the tidyverse package. Specifically, I will show how to manage missings in the long data format (i.e., more than one observation for id).

Load package and create a dataset

For this example, I will create a small data frame with 3 ids and 4 observations per id. The dataset has three variables including id, quarter, and spending. The spending variable has missings values. The first id has no missing, the second id has 1 missing, and the third id has 2 missings.

library(tidyverse)
dat = data.frame(
  id = c(1,1,1,1, 2,2,2,2, 3,3,3,3),
  quarter = rep(c("Jan", "Apr", "Jul", "Oct"), times=3),
  spending = c(22,35,10,64, 55,23,NA,10, 42,NA,NA,18)
  )
dat
##    id quarter spending
## 1   1     Jan       22
## 2   1     Apr       35
## 3   1     Jul       10
## 4   1     Oct       64
## 5   2     Jan       55
## 6   2     Apr       23
## 7   2     Jul       NA
## 8   2     Oct       10
## 9   3     Jan       42
## 10  3     Apr       NA
## 11  3     Jul       NA
## 12  3     Oct       18

Find the missings by group:

dat %>%
  group_by(id) %>% 
  summarise(missing=sum(is.na(spending)))
## # A tibble: 3 x 2
##      id missing
##   <dbl>   <int>
## 1     1       0
## 2     2       1
## 3     3       2

Management of missing values

The missing values are allocated on different IDs and observations. Therefore, first I will exclude all missings and keep only id/observation with valid data. Second, I will replace the missings by substituting with the average of the group ID, and third, replace missing by filling with valid data within the group id.

Removing all the missing values

To remove all the missing values will use the na.omit function.

na.omit(dat)
##    id quarter spending
## 1   1     Jan       22
## 2   1     Apr       35
## 3   1     Jul       10
## 4   1     Oct       64
## 5   2     Jan       55
## 6   2     Apr       23
## 8   2     Oct       10
## 9   3     Jan       42
## 12  3     Oct       18

Replace missing values with the average of the group

I think this is a most common method to replace missings values within the group. I would advise checking the distribution of data before deciding to replace missings with mean or median. In this example, I will use mean.

dat %>% 
  group_by(id) %>% 
  mutate(spending_mean = ifelse(is.na(spending), mean(spending, na.rm=T), spending))
## # A tibble: 12 x 4
## # Groups:   id [3]
##       id quarter spending spending_mean
##    <dbl> <fct>      <dbl>         <dbl>
##  1     1 Jan           22          22  
##  2     1 Apr           35          35  
##  3     1 Jul           10          10  
##  4     1 Oct           64          64  
##  5     2 Jan           55          55  
##  6     2 Apr           23          23  
##  7     2 Jul           NA          29.3
##  8     2 Oct           10          10  
##  9     3 Jan           42          42  
## 10     3 Apr           NA          30  
## 11     3 Jul           NA          30  
## 12     3 Oct           18          18

Filling the missings with other values within the group

This approach is useful when you need to replace the missing with the former or next value within the group. Filling of missing with other values can be in two directions, up and down. See the code below to distinguish between up and down.

dat_fill_down = dat %>% 
  group_by(id) %>% 
  fill(spending, .direction = c("down"))
dat_fill_up = dat %>% 
  group_by(id) %>% 
  fill(spending, .direction = c("up"))

Now, lets see the differences. When the direction is down the missing is replaced with the previous value within id. Direction up the replacement of missing will be with the next available value.

dat
##    id quarter spending
## 1   1     Jan       22
## 2   1     Apr       35
## 3   1     Jul       10
## 4   1     Oct       64
## 5   2     Jan       55
## 6   2     Apr       23
## 7   2     Jul       NA
## 8   2     Oct       10
## 9   3     Jan       42
## 10  3     Apr       NA
## 11  3     Jul       NA
## 12  3     Oct       18
dat_fill_down
## # A tibble: 12 x 3
## # Groups:   id [3]
##       id quarter spending
##    <dbl> <fct>      <dbl>
##  1     1 Jan           22
##  2     1 Apr           35
##  3     1 Jul           10
##  4     1 Oct           64
##  5     2 Jan           55
##  6     2 Apr           23
##  7     2 Jul           23
##  8     2 Oct           10
##  9     3 Jan           42
## 10     3 Apr           42
## 11     3 Jul           42
## 12     3 Oct           18
dat_fill_up
## # A tibble: 12 x 3
## # Groups:   id [3]
##       id quarter spending
##    <dbl> <fct>      <dbl>
##  1     1 Jan           22
##  2     1 Apr           35
##  3     1 Jul           10
##  4     1 Oct           64
##  5     2 Jan           55
##  6     2 Apr           23
##  7     2 Jul           10
##  8     2 Oct           10
##  9     3 Jan           42
## 10     3 Apr           18
## 11     3 Jul           18
## 12     3 Oct           18

In this short post I showed a few tips and tricks how to manage missing in the longitudinal data with tidyverse package.