A friend asked me whether I can create a loop which will run multiple regression models. She wanted to evaluate the association between 100 dependent variables (outcome) and 100 independent variable (exposure), which means 10,000 regression models. Regression models with multiple dependent (outcome) and independent (exposure) variables are common in genetics.

So models will be something like this: (dx is dependent and ix is independent variable, v are other variables)

dx1 = ix1 + v1 + v2 + v3
dx1 = ix2 + v1 + v2 + v3
dx1 = ix3 + v1 + v2 + v3
...
dx1 = ix100 + v1 + v2 + v3
dx2 = ix1 + v1 + v2 + v3
...
dx100 = ix100 + v1 + v2 + v3

The output should be a data frame with 5 columns, including dependent variable, independent variable, beta estimate, standard error and the p-value.
Something like this (those numbers are just for illustration purposes):

  d   i   beta se    pvalue
1 dx1 ix1 0.1  0.002 0.950
2 dx2 ix2 0.2  0.002 0.826
3 dx3 ix3 0.3  0.005 0.123

OK, now lets begin: the dataset that I received had all the variables in columns and observations in rows (the data is not real, just random numbers for illustration purposes):

id dx1 dx2 ... dx100 ix1 ... 1x100 v1 v2 v3
10 324 124 ... 214   32 ...  32    ax b4 c3
11 431 982 ... 114   12 ...  77    ce b2 c5
12 545 123 ... 104   34 ...  11    ar c2 a5
....

Position of variables

Create vectors for the position of the dependent and independent variables in your dataset.

# outcome
out_start=2
out_end= 101
out_nvar=out_end-out_start+1

out_variable=rep(NA, out_nvar)
out_beta=rep(NA, out_nvar)
out_se = rep(NA, out_nvar)
out_pvalue=rep(NA, out_nvar)

# exposure
exp_start=102
exp_end=203
exp_nvar=exp_end-exp_start+1

exp_variable=rep(NA, exp_nvar)
exp_beta=rep(NA, exp_nvar)
exp_se = rep(NA, out_nvar)
exp_pvalue=rep(NA, exp_nvar)

number=1

For Loop

I used linear mixed effect model and therefore I loaded the lme4 library. The loop should work with other regression analysis (i.e. linear regression) if you modify it according to your regression model. If you don’t know which part to modify, leave a comment below and I will try to help.

As other loops, this call variable of interest one by one and for each of them extract and store the betas, standard error, and p-value. Remember, this code is specific for linear mixed effect models.

library(lme4)
for (i in out_start:out_end){
  outcome = colnames(dat)[i]
  for (j in exp_start:exp_end){
    exposure = colnames(dat)[j]
    model <- lmer(get(outcome) ~ get(exposure) + v1 + (1|v2) + (1|v3),
      na.action = na.exclude,
      data=dat)

    Vcov <- vcov(model, useScale = FALSE)
    beta <- fixef(model)
    se <- sqrt(diag(Vcov))
    zval <- beta / se
    pval <- 2 * pnorm(abs(zval), lower.tail = FALSE)
    
    out_beta[number] = as.numeric(beta[2])
    out_se[number] = as.numeric(se[2])
    out_pvalue[number] = as.numeric(pval[2])
    out_variable[number] = outcome
    number = number + 1
    
    exp_beta[number] = as.numeric(beta[2])
    exp_se[number] = as.numeric(se[2])
    exp_pvalue[number] = as.numeric(pval[2])
    exp_variable[number] = exposure
    number = number + 1
  }
}

Create a dataframe with results:

outcome = data.frame(out_variable, out_beta, out_se, out_pvalue)
exposure = data.frame(exp_variable, exp_beta, exp_se, exp_pvalue)

Management of the dataframe

We have 2 different data frames with our results and we need to combine in one. With the help of tidyverse package this is a simple task. Basically, we rename variables by giving the same name and after we merge both data frames together.

library(tidyverse)
outcome = outcome %>% 
  rename(
    variable = out_variable,
    beta = out_beta,
    se = out_se,
    pvalue = out_pvalue,
    obs = out_nobs
    )
exposure = exposure %>% 
  rename(
    variable = exp_variable,
    beta = exp_beta,
    se = exp_se,
    pvalue = exp_pvalue,
    obs = exp_nobs
    )
all = rbind(outcome, exposure)
all = na.omit(all)

head(all)
     variable beta se    pvalue
1    dx1      0.1  0.002 0.950
3    dx2      0.2  0.002 0.826
........
2    ix1      0.1  0.002 0.950
4    ix2      0.2  0.002 0.826
........

Yet, this is not a data frame that we are looking for. We need a data frame to have both dependent and independent variables in one row. Therefore, we do the final transformation as follows:

data = all %>% 
  mutate(
    type = substr(variable, 1, 2)
  ) %>% 
  spread(type, variable) %>% 
  rename(
    d = dx,
    i = ix
  ) %>% 
  mutate (
    beta = round(beta, 5),
    se = round(se, 5),
    pvalue = round(pvalue, 5)
  ) %>% 
  select(d, i, beta, se, pvalue)

head(data)
  d   i   beta se    pvalue
1 dx1 ix1 0.1  0.002 0.950
2 dx2 ix2 0.2  0.002 0.826
3 dx3 ix3 0.3  0.005 0.123

I hope you find this post useful for your research and data analysis!