How to calculate the correlation coefficients for more than two variables

When I wanted to calculate the correlation coefficients for 25 variables it became tricky. As I aimed to export results in a table, the function cor was not helpful.

Moreover, I was interested in filtering the results by the p-value and export only the significant variables. Ok, after some search and help around I was able to accomplish this task and I am happy to share the code with you.

Load libraries

library(tidyverse)
library(broom)

Create the dataset

dt <- data.frame(a = rnorm(10) , b = rnorm(10), c =  rnorm(10), d =  rnorm(10))
dt
##             a           b           c          d
## 1   0.1296433 -0.04097794  1.05208987  1.1829971
## 2   1.1535071 -0.20391734 -0.41343509  0.3806597
## 3   1.0543683  0.76701232 -1.85961221  0.4893538
## 4  -0.2447034 -1.12584839  0.02915253  0.2820233
## 5   0.1323569 -0.74283490  0.12871463 -2.0830425
## 6  -1.7342506 -1.21694577  0.36379475  1.3919810
## 7  -0.3090799  0.71859777 -0.83134000 -0.5394136
## 8   0.9378126 -0.32229890  0.25171600  1.7425813
## 9  -0.6672502 -0.29552463 -0.12799691  1.2762921
## 10  0.3649565 -0.87161556 -0.82802152  0.1716353

Assess the correlation of all variables in the dataset

cor(dt)
##            a           b          c           d
## a  1.0000000  0.45227252 -0.3843381 -0.11109833
## b  0.4522725  1.00000000 -0.5235755 -0.03721051
## c -0.3843381 -0.52357546  1.0000000  0.24206394
## d -0.1110983 -0.03721051  0.2420639  1.00000000

Up to now everything is simple and nothing new, but as I mentioned above, that my interest was to export the results and to filter by p-value, the result of cor() is useless. In this case, some more coding is, and the map() functions from the broom package will be useful.

First, I will create a dataset with two columns which will show all variable names and every possible combination. In our example will be 6 rows.

dt1 = t(combn(names(dt), 2)) %>%
  as_data_frame() %>% 
  setNames(c("x", "y"))
dt1
## # A tibble: 6 x 2
##   x     y    
##   <chr> <chr>
## 1 a     b    
## 2 a     c    
## 3 a     d    
## 4 b     c    
## 5 b     d    
## 6 c     d

Now that we have the variable names and the dataset with the values will use the map2() to call the variables and apply cor.test function.

cor_result = dt1 %>%
  mutate(results = map2(x, y, ~ cor.test(dt[[.x]], dt[[.y]], method = "pearson")),
         results = map(results, tidy)) %>%
  unnest(results)
cor_result
## # A tibble: 6 x 10
##   x     y     estimate statistic p.value parameter conf.low conf.high
##   <chr> <chr>    <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl>
## 1 a     b       0.452      1.43    0.189         8   -0.248     0.842
## 2 a     c      -0.384     -1.18    0.273         8   -0.816     0.324
## 3 a     d      -0.111     -0.316   0.760         8   -0.692     0.558
## 4 b     c      -0.524     -1.74    0.120         8   -0.867     0.158
## 5 b     d      -0.0372    -0.105   0.919         8   -0.652     0.607
## 6 c     d       0.242      0.706   0.500         8   -0.457     0.756
## # ... with 2 more variables: method <chr>, alternative <chr>

To select and filter the results by the p value.

cor_result %>% 
  select(x, y, estimate, p.value) %>% 
  filter(p.value < 0.5)
## # A tibble: 3 x 4
##   x     y     estimate p.value
##   <chr> <chr>    <dbl>   <dbl>
## 1 a     b        0.452   0.189
## 2 a     c       -0.384   0.273
## 3 b     c       -0.524   0.120

With filtering, I end this post and hope you find it useful. If you have a better idea to accomplish the same results, please share with me.

CorrelationLinear RegressionTips & Tricks