DataScience+ An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.
Data Management

How to Deal with Missing Values in R

It might happen that your dataset is not complete, and when information is not available we call it missing values. In R the missing values are coded by the symbol NA. To identify missings in your dataset the function is is.na().

First lets create a small dataset:

Name <- c("John", "Tim", NA)
Sex <- c("men", "men", "women")
Age <- c(45, 53, NA)
dt <- data.frame(Name, Sex, Age)

Here is our dataset called dt:

dt 
Name   Sex Age
1 John   men  45
2  Tim   men  53
3  <NA> women  NA

Now will see for missings in the dataset:

is.na(dt)
Name    Sex   Age
FALSE FALSE FALSE
FALSE FALSE FALSE
TRUE  FALSE  TRUE

You also can find the sum and the percentage of missings in your dataset with the code below:

sum(is.na(dt))
mean(is.na(dt))
2
0.2222222

When you import dataset from other statistical applications the missing values might be coded with a number, for example 99. In order to let R know that is a missing value you need to recode it.

dt$Age[dt$Age == 99] <- NA

Another useful function in R to deal with missing values is na.omit() which delete incomplete observations.

Let see another example, by creating first another small dataset:

Name <- c("John", "Tim", NA)
Sex <- c("men", NA, "women")
Age <- c(45, 53, NA)
dt <- data.frame(Name, Sex, Age)

Here is the dataset, called again dt:

dt
Name Sex Age
John men  45
Tim  <NA>  53
<NA> women NA

Now will use the function to remove the missings

na.omit(dt)
Name Sex Age
John men  45

This was introduction for dealing with missings values. To learn how to impute missing data please read this post.