Recently I came across a machine learning algorithm called 'k-nearest neighbors' or 'kNN,' which is used as a predictive modeling tool. This algorithm uses data to build a model and then uses that model to predict the outcome.

kNN is new for me, and I gained most of my knowledge by reading these two tutorials; tutorial 1 and tutorial 2.

I will apply the kNN algorithm in the NHANES data set to predict diabetes.

Load 'tidyverse,' 'class,' and 'NHANES' packages.

library(tidyverse)
library(RNHANES)
library(class)

Import the dataset

DEMO_F = nhanes_load_data("DEMO_F", "2009-2010") %>%
  select(SEQN, RIDAGEYR)
BMX_F = nhanes_load_data("BMX_F", "2009-2010") %>% 
  select(SEQN, BMXBMI, BMXWT)
HDL_F =  nhanes_load_data("HDL_F", "2009-2010") %>% 
  select(SEQN, LBDHDD)
GLU_F = nhanes_load_data("GLU_F", "2009-2010") %>% 
  select(SEQN, LBXGLU, LBXIN)
DIQ_F = nhanes_load_data("DIQ_F", "2009-2010") %>% 
  select(SEQN, DIQ010)

Merge the data in one dataset and remove missing values.

dtx = left_join(DEMO_F, HDL_F) %>% 
  left_join(GLU_F) %>% 
  left_join(BMX_F) %>% 
  left_join(DIQ_F)

dat = dtx %>% 
  filter(!is.na(BMXBMI), !is.na(LBDHDD), !is.na(LBXGLU), !is.na(LBXIN),RIDAGEYR >= 40, DIQ010 %in% c(1, 2)) %>% 
  transmute(SEQN, Age = RIDAGEYR, BMI = BMXBMI, Cholest = LBDHDD, Glucose = LBXGLU, Insuline = LBXIN, Weight = BMXWT, Diabetes = DIQ010) %>% 
  mutate(Diabetes = recode_factor(Diabetes, 
                           `1` = "Yes", 
                           `2` = "No"))

Explore data

Now that I have the dataset, I will evaluate the correlation between two variables (glucose and weight) using the 'ggplot2' package.

ggplot(dat, aes(Glucose, Weight, color = Diabetes)) +
  geom_point(alpha = 0.7, size = 2) 

From the graph, it is evident that the higher the level of blood glucose more diabetes events we have.

Another relationship I want to assess is Blood glucose and BMI:

ggplot(dat, aes(Glucose, BMI, color = Diabetes)) +
  geom_point(alpha = 0.7, size = 2)

Prepare the data set

To proceed with preparing the data, I would like to get more information regarding all variables, and I use 'summary' function:

summary(dat)
##       SEQN            Age             BMI           Cholest          Glucose   
##  Min.   :51645   Min.   :40.00   Min.   :14.59   Min.   : 19.00   Min.   : 63  
##  1st Qu.:54366   1st Qu.:48.00   1st Qu.:24.96   1st Qu.: 43.00   1st Qu.: 95  
##  Median :57096   Median :59.00   Median :28.31   Median : 52.00   Median :102  
##  Mean   :56983   Mean   :59.23   Mean   :29.36   Mean   : 54.66   Mean   :111  
##  3rd Qu.:59593   3rd Qu.:70.00   3rd Qu.:32.62   3rd Qu.: 64.00   3rd Qu.:114  
##  Max.   :62158   Max.   :80.00   Max.   :84.87   Max.   :144.00   Max.   :375  
##     Insuline           Weight       Diabetes  
##  Min.   :  0.880   Min.   : 40.10   Yes: 286  
##  1st Qu.:  7.008   1st Qu.: 66.90   No :1486  
##  Median : 11.405   Median : 78.15             
##  Mean   : 15.101   Mean   : 81.73             
##  3rd Qu.: 17.995   3rd Qu.: 92.92             
##  Max.   :320.220   Max.   :230.70

Here I see a different range of values in my variables; therefore, I will normalize my numeric variables (the predictors) to prepare them for using in the kNN algorithm. I create the normalize function and then apply it to the predictor variables as below:

normalize <- function (i) {
  (i - min(i))/(max(i) - min(i))
}

norm_dat <- dat %>% 
  select(Age, BMI, Cholest, Glucose, Insuline, Weight) %>% 
  lapply(., normalize) %>% 
  as.data.frame()

All the new values are within the range of 0 and 1.

summary(norm_dat)
##       Age              BMI            Cholest          Glucose          Insuline      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2000   1st Qu.:0.1476   1st Qu.:0.1920   1st Qu.:0.1026   1st Qu.:0.01919  
##  Median :0.4750   Median :0.1952   Median :0.2640   Median :0.1250   Median :0.03296  
##  Mean   :0.4807   Mean   :0.2102   Mean   :0.2853   Mean   :0.1539   Mean   :0.04453  
##  3rd Qu.:0.7500   3rd Qu.:0.2565   3rd Qu.:0.3600   3rd Qu.:0.1635   3rd Qu.:0.05359  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      Weight      
##  Min.   :0.0000  
##  1st Qu.:0.1406  
##  Median :0.1996  
##  Mean   :0.2184  
##  3rd Qu.:0.2772  
##  Max.   :1.0000

To evaluate my kNN model's performance, I will divide the data set into a training and a test set. In the training dataset, I will perform the algorithm, and the test set will serve to asses the algorithm. The split must be 2/3 for a train set and 1/3 for the test set, and I need to make sure that both data sets have the same ratio of 'having or not' diabetes participants.

I use the 'sample()' function to take a sample with the same size that is my dataset 'dat' and assign 1 or 2 to each row with probability 0.67 and 0.33. Then, use it to define my training and test data sets:

dat_samp <- sample(2, nrow(dat), replace=TRUE, prob=c(0.67, 0.33))
dat_training <- norm_dat[dat_samp==1, 1:6]
dat_test <- norm_dat[dat_samp==2, 1:6]

I will store the 8-th column of my train data, which is the target variable (diabetes) in 'dat_target_group' because it will be used as 'cl' argument in knn function. Also, store the 8-th column of my test set in 'dat_target_group,' which I will use later to test the accuracy of the algorithm used.

dat_target_group <- dat[dat_samp==1, 8]
dat_test_group <- dat[dat_samp==2, 8]

Run the kNN algorithm

dat_pred <- knn(train = dat_training, test = dat_test, cl = dat_target_group, k=3)
dat_pred
##   [1] No  No  No  No  No  Yes No  No  No  No  Yes No  No  Yes No  No  No  No  No  No  No 
##  [22] Yes No  No  No  No  Yes No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No 
##  [43] No  No  Yes No  Yes No  No  No  No  No  No  No  Yes No  No  Yes No  No  No  No  No 
##  [64] No  No  No  No  No  No  No  Yes No  No  No  Yes No  No  No  No  No  No  Yes No  No 
##  [85] No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No 
## [106] Yes No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  Yes
## [127] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [148] No  No  No  No  Yes No  No  Yes No  No  No  No  Yes No  No  No  Yes No  No  No  No 
## [169] No  No  No  No  No  No  No  No  Yes No  No  No  Yes No  No  Yes No  No  No  No  No 
## [190] No  Yes No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  Yes No  No 
## [211] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [232] No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  Yes No  No 
## [253] Yes No  No  No  No  Yes No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No 
## [274] No  Yes Yes No  Yes No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No 
## [295] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No 
## [316] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [337] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No 
## [358] Yes No  No  No  No  Yes No  No  No  No  No  No  No  Yes Yes No  No  No  No  No  No 
## [379] No  No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No 
## [400] No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No 
## [421] No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No 
## [442] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [463] No  No  No  No  No  No  No  No  No  No  No  Yes Yes Yes No  No  No  No  No  Yes No 
## [484] No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No 
## [505] No  No  No  No  Yes No  No  No  No  No  No  No  No  No  Yes Yes Yes No  No  Yes Yes
## [526] No  No  No  No  No  No  No  No  No  No 
## Levels: Yes No

These are my predictive values. Note that the k parameter is often an odd number and means the amount of nearest neighbors you decide to check for every participant to determine in what category (Yes/No) for diabetes will he be assigned.
Below I can check the participants with and without diabetes in both datasets:

summary(dat_pred)
## Yes  No 
##  58 477
summary(dat_test_group)
## Yes  No 
##  86 449

Evaluate the model

To evaluate my kNN model I build a specific table (confusion matrix) with 'table' function putting my predictive values 'dat_pred' and values from my test set 'dat_test_group' to asses the kNN model.

tab <- table(dat_pred, dat_test_group)
tab
##         dat_test_group
## dat_pred Yes  No
##      Yes  37  21
##      No   49 428

Now I create the 'accuracy' function and use it to evaluate 'tab' :

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
## [1] 86.91589

In the NHANES dataset, I have run the k-nearest neighbor algorithm that gave me an 87% accurate.

As I mentioned in the top of the post, I am new to kNN so please suggest or correct my code.