We publish R & Python tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

Regression Models

A friend asked me whether I can create a loop which will run multiple regression models. She wanted to evaluate the association between 100 dependent variables (outcome) and 100 independent variable (exposure), which means 10,000 regression models. Regression models with multiple dependent (outcome) and independent (exposure) variables are common in genetics.

So models will be something like this: (dx is dependent and ix is independent variable, v are other variables)

dx1 = ix1 + v1 + v2 + v3 dx1 = ix2 + v1 + v2 + v3 dx1 = ix3 + v1 + v2 + v3 ... dx1 = ix100 + v1 + v2 + v3 dx2 = ix1 + v1 + v2 + v3 ... dx100 = ix100 + v1 + v2 + v3

The output should be a data frame with 5 columns, including dependent variable, independent variable, beta estimate, standard error and the p-value.

Something like this (those numbers are just for illustration purposes):

d i beta se pvalue 1 dx1 ix1 0.1 0.002 0.950 2 dx2 ix2 0.2 0.002 0.826 3 dx3 ix3 0.3 0.005 0.123

OK, now lets begin: the dataset that I received had all the variables in columns and observations in rows (the data is not real, just random numbers for illustration purposes):

id dx1 dx2 ... dx100 ix1 ... 1x100 v1 v2 v3 10 324 124 ... 214 32 ... 32 ax b4 c3 11 431 982 ... 114 12 ... 77 ce b2 c5 12 545 123 ... 104 34 ... 11 ar c2 a5 ....

Create vectors for the position of the dependent and independent variables in your dataset.

# outcome out_start=2 out_end= 101 out_nvar=out_end-out_start+1 out_variable=rep(NA, out_nvar) out_beta=rep(NA, out_nvar) out_se = rep(NA, out_nvar) out_pvalue=rep(NA, out_nvar) # exposure exp_start=102 exp_end=203 exp_nvar=exp_end-exp_start+1 exp_variable=rep(NA, exp_nvar) exp_beta=rep(NA, exp_nvar) exp_se = rep(NA, out_nvar) exp_pvalue=rep(NA, exp_nvar) number=1

I used linear mixed effect model and therefore I loaded the `lme4`

library. The loop should work with other regression analysis (i.e. linear regression), if you modify it according to your regression model. If you don’t know which part to modify, leave a comment below and I will try to help.

As other loops, this call variables of interest one by one and for each of them extract and store the betas, standard error and p value. Remember, this code is specific for linear mixed effect models.

library(lme4) for (i in out_start:out_end){ outcome = colnames(dat)[i] for (j in exp_start:exp_end){ exposure = colnames(dat)[j] model <- lmer(get(outcome) ~ get(exposure) + v1 + (1|v2) + (1|v3), na.action = na.exclude, data=dat) Vcov <- vcov(model, useScale = FALSE) beta <- fixef(model) se <- sqrt(diag(Vcov)) zval <- beta / se pval <- 2 * pnorm(abs(zval), lower.tail = FALSE) out_beta[number] = as.numeric(beta[2]) out_se[number] = as.numeric(se[2]) out_pvalue[number] = as.numeric(pval[2]) out_variable[number] = outcome number = number + 1 exp_beta[number] = as.numeric(beta[2]) exp_se[number] = as.numeric(se[2]) exp_pvalue[number] = as.numeric(pval[2]) exp_variable[number] = exposure number = number + 1 } }

Create a dataframe with results:

outcome = data.frame(out_variable, out_beta, out_se, out_pvalue) exposure = data.frame(exp_variable, exp_beta, exp_se, exp_pvalue)

We have 2 different dataframes with our results and we need to combine in one. With the help of `tidyverse`

package this is a simple task. Basically we rename variables by giving the same name and after we merge both dataframes together.

library(tidyverse) outcome = outcome %>% rename( variable = out_variable, beta = out_beta, se = out_se, pvalue = out_pvalue, obs = out_nobs ) exposure = exposure %>% rename( variable = exp_variable, beta = exp_beta, se = exp_se, pvalue = exp_pvalue, obs = exp_nobs ) all = rbind(outcome, exposure) all = na.omit(all) head(all)variable beta se pvalue 1 dx1 0.1 0.002 0.950 3 dx2 0.2 0.002 0.826 ........ 2 ix1 0.1 0.002 0.950 4 ix2 0.2 0.002 0.826 ........

Yet, this is not a dataframe that we are looking for. We need a dataframe to have both dependent and independent variables in one row. Therefore, we do the final transformation as follows:

data = all %>% mutate( type = substr(variable, 1, 2) ) %>% spread(type, variable) %>% rename( d = dx, i = ix ) %>% mutate ( beta = round(beta, 5), se = round(se, 5), pvalue = round(pvalue, 5) ) %>% select(d, i, beta, se, pvalue) head(data)d i beta se pvalue 1 dx1 ix1 0.1 0.002 0.950 2 dx2 ix2 0.2 0.002 0.826 3 dx3 ix3 0.3 0.005 0.123

I hope you find this post useful for your research and data analysis!