Handling missing data with MICE package; a simple approach

By DataScience+ · June 6, 2016 · 1 min read · 92.0K views · 40 comments

This is a quick, short, and concise tutorial on how to impute missing data. Previously, we published an extensive tutorial on imputing missing values with the MICE package. The current tutorial aims to be simple and user-friendly for those who are just starting to use R.

Preparing the dataset

I have created a simulated dataset, which you can load into your R environment by using the following code.

dat <- read.csv(url("https://goo.gl/4DYzru"), header=TRUE, sep=",")

Let’s look at the first rows of the dataset.

head(dat)
##    Age Gender Cholesterol SystolicBP  BMI Smoking Education
## 1 67.9 Female       236.4      129.8 26.4     Yes      High
## 2 54.8 Female       256.3      133.4 28.4      No    Medium
## 3 68.4   Male       198.7      158.5 24.1     Yes      High
## 4 67.9   Male       205.0      136.0 19.9      No       Low
## 5 60.9   Male       207.7      145.4 26.7      No    Medium
## 6 44.9 Female       222.5      130.6 30.6      No       Low

Check the data for missing values.

sapply(dat, function(x) sum(is.na(x)))
##         Age      Gender Cholesterol  SystolicBP         BMI     Smoking 
##           0           0           0           0           0           0 
##   Education 
##           0

Since there are no missing values, I will add some NAs to the dataset. But first I will duplicate the original dataset, so that later we can evaluate the accuracy of the imputation.

original <- dat

Now I will add some missing values to a few variables.

set.seed(10)
dat[sample(1:nrow(dat), 20), "Cholesterol"] <- NA
dat[sample(1:nrow(dat), 20), "Smoking"] <- NA
dat[sample(1:nrow(dat), 20), "Education"] <- NA
dat[sample(1:nrow(dat), 5), "Age"] <- NA
dat[sample(1:nrow(dat), 5), "BMI"] <- NA

Confirm the presence of missing values in the dataset.

sapply(dat, function(x) sum(is.na(x)))
##         Age      Gender Cholesterol  SystolicBP         BMI     Smoking 
##           5           0          20           0           5          20 
##   Education 
##          20

The next step is to convert the variables to factors or numeric, as appropriate. For example, smoking and education are categorical variables, whereas cholesterol level is continuous.

library(dplyr) 
dat <- dat %>%
    mutate(
        Smoking = as.factor(Smoking),
        Education = as.factor(Education),
        Cholesterol = as.numeric(Cholesterol)
    )

Let's look at the dataset structure.

str(dat)
## 'data.frame':    250 obs. of  7 variables:
##  $ Age        : num  67.9 54.8 68.4 67.9 60.9 44.9 49.9 NA 57.5 77.2 ...
##  $ Gender     : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 1 2 2 ...
##  $ Cholesterol: num  236 256 199 205 208 ...
##  $ SystolicBP : num  130 133 158 136 145 ...
##  $ BMI        : num  26.4 28.4 24.1 19.9 26.7 30.6 27.3 27.5 28.3 29.1 ...
##  $ Smoking    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ Education  : Factor w/ 3 levels "High","Low","Medium": 1 3 1 NA NA 2 3 2 1 1 ...

Everything looks OK, so let's proceed with the imputation.

Imputation

Now that the dataset is ready for imputation, we will call the mice package. The code below is standard, and you don't need to change anything besides the dataset name.

library(mice)
init = mice(dat, maxit=0) 
meth = init$method
predM = init$predictorMatrix

To impute the missing values, the mice package uses an algorithm that draws on information from the other variables in the dataset to predict and impute the missing values. Therefore, you may not want to use certain variables as predictors. For example, an ID variable does not have any predictive value.

The code below removes a variable as a predictor, but the variable itself will still be imputed. Just for illustration purposes, I select the BMI variable not to be included as a predictor during imputation.

predM[, c("BMI")]=0

If you want to exclude a variable from imputation, use the code below. This variable will still be used for prediction.

meth[c("Age")]=""

Now let's specify the methods for imputing the missing values. There are specific methods for continuous, binary, and ordinal variables. I set a different method for each variable; you can add more than one variable to each method.

meth[c("Cholesterol")]="norm" 
meth[c("Smoking")]="logreg" 
meth[c("Education")]="polyreg"

Now it is time to run the multiple (m=5) imputation.

set.seed(103)
imputed = mice(dat, method=meth, predictorMatrix=predM, m=5)
##  iter imp variable
##   1   1  Cholesterol  BMI  Smoking  Education
##   1   2  Cholesterol  BMI  Smoking  Education
##   1   3  Cholesterol  BMI  Smoking  Education
##   1   4  Cholesterol  BMI  Smoking  Education
##   1   5  Cholesterol  BMI  Smoking  Education
##   2   1  Cholesterol  BMI  Smoking  Education
##   2   2  Cholesterol  BMI  Smoking  Education
...

Create a dataset after the imputation.

imputed <- complete(imputed)

Check for missing values in the imputed dataset. As expected, only Age still has missing values, because we skipped it from the imputation above.

sapply(imputed, function(x) sum(is.na(x)))
##         Age      Gender Cholesterol  SystolicBP         BMI     Smoking 
##           5           0           0           0           0           0 
##   Education 
##           0

Accuracy

In this example, we know the actual values of the missing data, since I added the missing values myself. This means we can check the accuracy of the imputation. However, we should acknowledge that this is a simulated dataset, and therefore the variables have no scientific meaning and are not correlated with each other. For that reason, I expect a lower accuracy for this imputation.

# Cholesterol
actual <- original$Cholesterol[is.na(dat$Cholesterol)]
predicted <- imputed$Cholesterol[is.na(dat$Cholesterol)]
mean(actual)
mean(predicted)
# Smoking
actual <- original$Smoking[is.na(dat$Smoking)]
predicted <- imputed$Smoking[is.na(dat$Smoking)]
table(actual)
table(predicted)
## [1] 231.07
## [1] 231.3564
## actual
##  No Yes
##  11   9
## predicted
##  No Yes
##  14   6

The means of the actual and predicted values for cholesterol are almost identical, which shows a high accuracy of imputation, whereas for smoking the accuracy is low.

That's it — I hope you find this tutorial useful. If you have any questions, feel free to comment below.

#mice package #missing values #tips & tricks

40 Comments

DG

Dann G January 21, 2020

Can I used the models for imput missing data into another data set? For example, if I have a data set that includes the same variables but with missing data how can I do that?
Reply
KM

Kshitij Manvelikar December 18, 2019

HI, So even if I have categorical values & if I use mice over data set will it be fine? Or I should remove categorical col first & then use MICE over data set?
Reply
SC

Stewart Chang September 18, 2019

Thanks for this post. In the last code snippet, I think “table(actuals)” should be “table(actual)”.
Reply
QY

qingye yuan July 2, 2019

This one could be convenient at times
Reply
SB

Sultan Bhai June 13, 2019

Hi

I was following exactly same code as you have mention and have received following error ” Error: Length of method differs from number of blocks ”
when executing below line of code

set.seed(103)
imputed = mice(dat, method=meth, predictorMatrix=predM, m=5)

how do i resolve this
Reply
1. K
  
  Klodian June 13, 2019
  
  It looks like the number of variables you want to impute do not correspondent with those you defined. You may want to keep only variables plan to impute in the dataset and see if you get rid of the error.
  Reply
TD

Tahira Devji April 7, 2018

Hi there, I am having issues when trying to run this. See below.

> #Run the multiple (m=5) imputation
> library(mice)
> init = mice(hearttrans.dat, maxit=0)
> meth = init$method
> predM = init$predictorMatrix
>
> meth[c(“donor cod”)]=”logreg”
> imputed = mice(hearttrans.dat,method=meth,predictorMatrix=predM, m=5)
Error in check.method(setup, data) :
The length of method (13) does not match the number of columns in the data (12).

I have not shown here but in addition to specifying the method, I would also like to remove / skip some variables. When I tried to do this I got a similar error. Could you please help?
Reply
1. K
  
  Klodian April 7, 2018
  
  Double check the number of variables you want to impute and exclude from imputation.
  Reply
  1. TD
    
    Tahira Devji December 7, 2018
    
    #These are all my variables
    dheight<-hearttrans.dat$donor.height
    dweight<-hearttrans.dat$donor.weight
    dage<-hearttrans.dat$donor.age
    dsex<-as.factor(hearttrans.dat$donor.sex)
    donorcod<-as.character(hearttrans.dat$donor.cod)
    ischtime<-hearttrans.dat$ischemic.time
    rage<-hearttrans.dat$recip.age
    rsex<-as.factor(hearttrans.dat$recip.sex)
    rheight<-hearttrans.dat$recip.height
    rweight<-hearttrans.dat$recip.weight
    survt<-hearttrans.dat$survt.months
    status<-hearttrans.dat$recip.death
    sexmismatch<-as.factor(hearttrans.dat$sex.mismatch)
    diffpHMLVM<-hearttrans.dat$diff.pHM.LVM
    diffpHMRVM predM[,c(“sexmismatch”,”diffpHMLVM”,”diffpHMRVM”)]=0
    Error in `[ meth[c(“dheight”,”dweight”,”dage”,”ischtime”,”rage”,”rweight”,”rheight”)]=”norm”
    > meth[c(“dsex”,”rsex”,”donorcod”)]=”logreg”
    > imputed = mice(hearttrans.dat, method=meth, predictorMatrix=predM, m=5)
    Error in check.method(setup, data) :
    The length of method (25) does not match the number of columns in the data (15).
    
    I have been at this for a full day… Could you please help?
    Reply
ST

Sanjay Tamrakar December 25, 2017

I copied and pasted the same thing but when I looked at the output it stills shows one of the value is missing for education

sapply(imputed, function(x) sum(is.na(x)))
Age Gender Cholesterol SystolicBP BMI Smoking Education
5 0 0 0 0 0 1
Reply
1. S
  
  Sunny December 14, 2018
  
  same issue here mate – anybody else?
  Reply
  1. J
    
    Jackie December 16, 2018
    
    Same here. odd
    Reply
VV

Vishnu Vardhan September 12, 2017

I have got the following error.

Error in inherits(x, “mids”) : object ‘imputed’ not found.
Please help me in sorting out this.
Reply
AT

Alice Thomson July 28, 2017

Hi @datascienceplus:disqus thank you very much for the useful tutorial. I’m hoping to use this method for an online survey I am doing as part of my PhD research, the survey is comprised of several different questionnaires which each have subscales and items relating to those. I have large amounts of missing data through survey dropout and I don’t want to impute this data e.g. if a person misses several questionnaires. I would like to impute missing items at a subscale level though, and only use the other items on that same subscale for the item imputation. I would need to repeat this several times for all the different subscales. Do you know if this is possible using the MICE package? It’s not possible on SPSS which is what I usually use so I am considering learning R so that I can do the multiple imputations. Any advice you could give me would be hugely appreciated!
Reply
1. K
  
  Klodian July 28, 2017
  
  Can you create different datasets based on given conditions? Then you can do imputation on each dataset independently.
  Reply
  1. AT
    
    Alice Thomson July 30, 2017
    
    Thanks for the reply @datascienceplus:disqus I could try that – it would be a lot of datasets because I have several subscales. Do you know if it would be straightforward to merge the datasets with the multiple imputations? Sorry if it’s a silly question – I am very new to R. Thanks, Alice.
    Reply
    1. K
      
      Klodian July 30, 2017
      
      It is not entirely clear for me what exactly you want to accomplish; you may want to create a loop to find create and merge datasets automatically.
      
      The MICE package allows you to choose which variables you want to impute and to use as predictor without imputing and other combinations as well.
      Reply
PG

Petre Grigoraș June 6, 2017

I find MICE to be fairly okay, but with some severe drawbacks. Like the fact that you are not able to obtain correlational estimates for the variables within the pooled data. or you can’t do factor analysis, etc. LM, Logreg and two other functions and thats about it.
Reply
L

lucazav April 8, 2017

Sorry @datascienceplus:disqus ,
it seems I had the wrong data in memory… Now it seems working.
Thank you
Reply
L

lucazav April 8, 2017

Hi @datascienceplus:disqus ,

thank you for your tutorial.
When I exclude some columns from the meth vector in this way:

meth[c(“col1”, “col2”)] = “”

an than execute the mice command, I get this error:

Error in check.method(setup, data) :
The length of method (xx) does not match the number of columns in the data (yy).

How can I solve that?

Thanks.
Reply
1. L
  
  lucazav April 8, 2017
  
  Sorry @datascienceplus:disqus ,
  it seems I had some wrong data in memory in my session. Now it’s working.
  Thank you again.
  Reply
V

Vasileios February 27, 2017

Great tutorial, however when I try to create a dataset after the imputation, I get this error Error in file(file, “rt”) : invalid ‘description’ argument every time I call the complete() function. Any idea why this is happening?
Reply
1. K
  
  Klodian February 27, 2017
  
  Check your variables if those are correctly defined before imputation.
  Reply
  1. V
    
    Vasileios February 27, 2017
    
    I defined them exactly as it is described here but I still get this same error…
    Reply
    1. K
      
      Klodian February 27, 2017
      
      Is this your dataset or the example data above? If is your dataset, try to do imputation only few variables and check again if it works. I never had such a issue with imputation.
      Reply
      1. V
        
        Vasileios February 27, 2017
        
        I am using the exact same dataset. I found out why it was not working. For some reason I had another function called “complete” in my environment, and I deleted it and then it worked! Thanks anyway!
D

datacrazy December 29, 2016

Thanks a lot for sharing your insight on the topic. I am working on a dataset with 14 explanatory variables and 126 observations. This has been of great help! Although I had a slight problem in using the meth matrix version and since all of my variables are continuous, I found that simply writing method=”pmm” in the code solved my problem.
Reply
AH

Ashok Harnal November 12, 2016

Excellent tutorial
Reply
R

Raghav November 10, 2016

How do I use imputation for the count data? Using the above mentioned techniques, i get negative values. But I should not have them. Only positive values are needed. thanks!
Reply
1. K
  
  Klodian November 10, 2016
  
  probably your data is not normally distributed? In that case you should log() transform the data.
  Reply
  1. R
    
    Raghav November 11, 2016
    
    Oh great! Thank you. I will try that.
    Reply
  2. AS
    
    Anna Søndergaard March 12, 2018
    
    How exactly would you log() transform an entire data set?
    Would you only log transform the variables that become negative when doing imputation?
    Reply
    1. K
      
      Klodian March 13, 2018
      
      Only variables which are non-normally distributed. I think it is possible with MICE to set the minimum value that the should get from imputation. Check the MICE package for more information in this regard.
      Reply
E

expecto June 11, 2016

Thanks for showing this package, I never thought there might be a package to deal with missing data.
Reply
DA

DK Addyson June 7, 2016

what are strategies you can use for imputation methods to check quality/”accuracy”? when you don’t have the true values there as a reference?
Reply
F

fredycar June 7, 2016

When run imputed, I obtain this error:

Error in check.predictorMatrix(setup) :
Argument ‘predictorMatrix’ not a matrix.

What happend….Help me please…It’s a nice example…
Reply
1. G
  
  gimmesilver June 7, 2016
  
  please try this code ‘predM[, c(“BMI”)]=0’ instead of ‘predM[c(“BMI”)]=0’
  Reply
  1. K
    
    Klodian June 7, 2016
    
    Correct. Thanks.
    Reply
    1. TD
      
      Tahira Devji April 7, 2018
      
      predM[,c(“sexmismatch”,”diffpHMLVM”,”diffpHMRVM”)]=0
      
      When I enter that I get this error…
      
      Error in `[<-`(`*tmp*`, , c("sexmismatch", "diffpHMLVM", "diffpHMRVM"), :
      subscript out of bounds
      
      Should it not be: predM[c("sexmismatch","diffpHMLVM","diffpHMRVM")]=0
      
      ??
      Reply
  2. F
    
    fredycar June 9, 2016
    
    Nice…the code works perfectly….Thanks a lot again….
    Reply

Handling missing data with MICE package; a simple approach

Preparing the dataset

Imputation

Accuracy

40 Comments

Leave a comment Cancel reply

More in Data Management

Imputing Missing Data in R: mice, missRanger, and VIM Compared

From Wide to Long: Reshaping World Bank Data with pivot_longer

How to scrape the FOMC’s economic projections and replicate its Dot Plot in Python