1 Introduction

In many analyses, a large amount of variables have to be tested independently against the trait/endpoint of interest, and also adjusted for covariates and confounding factors at the same time. The major bottleneck in these is the amount of time that it takes to complete these analyses.

With RegParallel, a large number of tests can be performed simultaneously. On a 12-core system, 144 variables can be tested simultaneously, with 1000s of variables processed in a matter of seconds via ‘nested’ parallel processing.

Works for logistic regression, linear regression, conditional logistic regression, Cox proportional hazards and survival models, and Bayesian logistic regression. Also caters for generalised linear models that utilise survey weights created by the ‘survey’ CRAN package and that utilise ‘survey::svyglm’.

2 Installation

2.1 1. Download the package from Bioconductor

  if (!requireNamespace('BiocManager', quietly = TRUE))
    install.packages('BiocManager')

  BiocManager::install('RegParallel')

Note: to install development version:

  remotes::install_github('kevinblighe/RegParallel')

2.2 2. Load the package into R session

  library(RegParallel)

3 Quick start

For this quick start, we will follow the tutorial (from Section 3.1) of RNA-seq workflow: gene-level exploratory analysis and differential expression. Specifically, we will load the ‘airway’ data, where different airway smooth muscle cells were treated with dexamethasone.

  library(airway)
  library(magrittr)

  data('airway')
  airway$dex %<>% relevel('untrt')

Normalise the raw counts in DESeq2 and produce regularised log expression levels:

  library(DESeq2)

  dds <- DESeqDataSet(airway, design = ~ dex + cell)
  dds <- DESeq(dds, betaPrior = FALSE)
  rldexpr <- assay(rlog(dds, blind = FALSE))
  rlddata <- data.frame(colData(airway), t(rldexpr))

3.1 Perform the most basic logistic regression analysis

Here, we fit a binomial logistic regression model to the data via glmParallel, with dexamethasone as the dependent variable.

  ## NOT RUN

  res1 <- RegParallel(
    data = rlddata[ ,1:3000],
    formula = 'dex ~ [*]',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit')),
    FUNtype = 'glm',
    variables = colnames(rlddata)[10:3000])

  res1[order(res1$P, decreasing=FALSE),]
##              Variable            Term       Beta StandardError             Z
##    1: ENSG00000095464 ENSG00000095464   43.27934  2.593463e+01  1.6687854476
##    2: ENSG00000071859 ENSG00000071859   12.96251  7.890287e+00  1.6428433092
##    3: ENSG00000069812 ENSG00000069812  -44.37139  2.704021e+01 -1.6409412536
##    4: ENSG00000072415 ENSG00000072415  -19.90841  1.227527e+01 -1.6218306224
##    5: ENSG00000073921 ENSG00000073921   14.59470  8.999831e+00  1.6216635641
##   ---                                                                       
## 2817: ENSG00000068831 ENSG00000068831  110.84893  2.729072e+05  0.0004061781
## 2818: ENSG00000069020 ENSG00000069020 -186.45744  4.603615e+05 -0.0004050239
## 2819: ENSG00000083642 ENSG00000083642 -789.55666  1.951104e+06 -0.0004046717
## 2820: ENSG00000104331 ENSG00000104331  394.14700  9.749138e+05  0.0004042891
## 2821: ENSG00000083097 ENSG00000083097 -217.48873  5.398191e+05 -0.0004028919
##                P            OR      ORlower      ORupper
##    1: 0.09515991  6.251402e+18 5.252646e-04 7.440065e+40
##    2: 0.10041536  4.261323e+05 8.190681e-02 2.217017e+12
##    3: 0.10080961  5.367228e-20 5.165170e-43 5.577191e+03
##    4: 0.10483962  2.258841e-09 8.038113e-20 6.347711e+01
##    5: 0.10487540  2.179701e+06 4.761313e-02 9.978541e+13
##   ---                                                   
## 2817: 0.99967592  1.383811e+48 0.000000e+00           NA
## 2818: 0.99967684  1.053326e-81 0.000000e+00           NA
## 2819: 0.99967712  0.000000e+00 0.000000e+00           NA
## 2820: 0.99967742 1.499223e+171 0.000000e+00           NA
## 2821: 0.99967854  3.514359e-95 0.000000e+00           NA

3.2 Perform a basic linear regression

Here, we will perform the linear regression using both glmParallel and lmParallel. We will appreciate that a linear regression is the same using either function with the default settings.

Regularised log expression levels from our DESeq2 data will be used.

  rlddata <- rlddata[ ,1:2000]

  res2 <- RegParallel(
    data = rlddata,
    formula = '[*] ~ cell',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = colnames(rlddata)[10:ncol(rlddata)],
    p.adjust = "none")

  res3 <- RegParallel(
    data = rlddata,
    formula = '[*] ~ cell',
    FUN = function(formula, data)
      lm(formula = formula,
        data = data),
    FUNtype = 'lm',
    variables = colnames(rlddata)[10:ncol(rlddata)],
    p.adjust = "none")

  subset(res2, P<0.05)
##             Variable        Term        Beta StandardError          t
##   1: ENSG00000001461 cellN061011 -0.46859875    0.10526111  -4.451775
##   2: ENSG00000001461 cellN080611 -0.84020922    0.10526111  -7.982143
##   3: ENSG00000001461  cellN61311 -0.87778101    0.10526111  -8.339082
##   4: ENSG00000001561 cellN080611 -1.71802758    0.13649920 -12.586357
##   5: ENSG00000001561  cellN61311 -1.05328889    0.13649920  -7.716448
##  ---                                                                 
## 519: ENSG00000092108 cellN061011 -0.12721659    0.01564082  -8.133625
## 520: ENSG00000092108  cellN61311 -0.12451203    0.01564082  -7.960708
## 521: ENSG00000092148 cellN080611 -0.34988071    0.10313461  -3.392467
## 522: ENSG00000092200 cellN080611  0.05906656    0.01521063   3.883241
## 523: ENSG00000092208 cellN080611 -0.28587683    0.08506716  -3.360602
##                 P        OR   ORlower   ORupper
##   1: 0.0112313246 0.6258787 0.5092039 0.7692873
##   2: 0.0013351958 0.4316202 0.3511586 0.5305181
##   3: 0.0011301853 0.4157043 0.3382098 0.5109554
##   4: 0.0002293465 0.1794197 0.1373036 0.2344544
##   5: 0.0015182960 0.3487887 0.2669157 0.4557753
##  ---                                           
## 519: 0.0012429963 0.8805429 0.8539591 0.9079544
## 520: 0.0013489163 0.8829276 0.8562718 0.9104133
## 521: 0.0274674209 0.7047722 0.5757851 0.8626549
## 522: 0.0177922771 1.0608458 1.0296864 1.0929482
## 523: 0.0282890537 0.7513552 0.6359690 0.8876762
  subset(res3, P<0.05)
##             Variable        Term        Beta StandardError          t
##   1: ENSG00000001461 cellN061011 -0.46859875    0.10526111  -4.451775
##   2: ENSG00000001461 cellN080611 -0.84020922    0.10526111  -7.982143
##   3: ENSG00000001461  cellN61311 -0.87778101    0.10526111  -8.339082
##   4: ENSG00000001561 cellN080611 -1.71802758    0.13649920 -12.586357
##   5: ENSG00000001561  cellN61311 -1.05328889    0.13649920  -7.716448
##  ---                                                                 
## 519: ENSG00000092108 cellN061011 -0.12721659    0.01564082  -8.133625
## 520: ENSG00000092108  cellN61311 -0.12451203    0.01564082  -7.960708
## 521: ENSG00000092148 cellN080611 -0.34988071    0.10313461  -3.392467
## 522: ENSG00000092200 cellN080611  0.05906656    0.01521063   3.883241
## 523: ENSG00000092208 cellN080611 -0.28587683    0.08506716  -3.360602
##                 P        OR   ORlower   ORupper
##   1: 0.0112313246 0.6258787 0.5092039 0.7692873
##   2: 0.0013351958 0.4316202 0.3511586 0.5305181
##   3: 0.0011301853 0.4157043 0.3382098 0.5109554
##   4: 0.0002293465 0.1794197 0.1373036 0.2344544
##   5: 0.0015182960 0.3487887 0.2669157 0.4557753
##  ---                                           
## 519: 0.0012429963 0.8805429 0.8539591 0.9079544
## 520: 0.0013489163 0.8829276 0.8562718 0.9104133
## 521: 0.0274674209 0.7047722 0.5757851 0.8626549
## 522: 0.0177922771 1.0608458 1.0296864 1.0929482
## 523: 0.0282890537 0.7513552 0.6359690 0.8876762

3.3 Survival analysis via Cox Proportional Hazards regression

For this example, we will load breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene’s expression into Low|Mid|High based on Z-scores and compare these against RFS while adjusting for tumour grade in a Cox Proportional Hazards model.

First, let’s read in and prepare the data:

  library(Biobase)
  library(GEOquery)

  # load series and platform data from GEO
    gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
    x <- exprs(gset[[1]])

  # remove Affymetrix control probes
    x <- x[-grep('^AFFX', rownames(x)),]

  # transform the expression data to Z scores
    x <- t(scale(t(x)))

  # extract information of interest from the phenotype data (pdata)
    idx <- which(colnames(pData(gset[[1]])) %in%
      c('age:ch1', 'distant rfs:ch1', 'er:ch1',
        'ggi:ch1', 'grade:ch1', 'node:ch1',
        'size:ch1', 'time rfs:ch1'))
    metadata <- data.frame(pData(gset[[1]])[,idx],
      row.names = rownames(pData(gset[[1]])))

  # remove samples from the pdata that have any NA value
    discard <- apply(metadata, 1, function(x) any(is.na(x)))
    metadata <- metadata[!discard,]

  # filter the Z-scores expression data to match the samples in our pdata
    x <- x[,which(colnames(x) %in% rownames(metadata))]

  # check that sample names match exactly between pdata and Z-scores 
    all((colnames(x) == rownames(metadata)) == TRUE)
## [1] TRUE
  # create a merged pdata and Z-scores object
    coxdata <- data.frame(metadata, t(x))

  # tidy column names
    colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
      'GGI', 'Grade', 'Node', 'Size', 'Time.RFS')

  # prepare certain phenotypes
    coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
    coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
    coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
    coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
    coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))

With the data prepared, we can now apply a Cox Proportional Hazards model independently for each probe in the dataset against RFS.

In this we also increase the default blocksize to 2000 in order to speed up the analysis.

  library(survival)
  res5 <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    p.adjust = "BH")
  res5 <- res5[!is.na(res5$P),]
  res5
##           Variable        Term          Beta StandardError             Z
##     1:  X1007_s_at  X1007_s_at  0.3780639987     0.3535022  1.0694811914
##     2:    X1053_at    X1053_at  0.1177398813     0.2275041  0.5175285346
##     3:     X117_at     X117_at  0.6265036787     0.6763106  0.9263549892
##     4:     X121_at     X121_at -0.6138126274     0.6166626 -0.9953783151
##     5:  X1255_g_at  X1255_g_at -0.2043297829     0.3983930 -0.5128849375
##    ---                                                                  
## 22211:   X91703_at   X91703_at -0.4124539527     0.4883759 -0.8445419981
## 22212: X91816_f_at X91816_f_at  0.0482030943     0.3899180  0.1236236554
## 22213:   X91826_at   X91826_at  0.0546751431     0.3319572  0.1647053850
## 22214:   X91920_at   X91920_at -0.6452125945     0.8534623 -0.7559942684
## 22215:   X91952_at   X91952_at -0.0001396044     0.7377681 -0.0001892254
##                P       LRT      Wald   LogRank        HR    HRlower  HRupper
##     1: 0.2848529 0.2826716 0.2848529 0.2848400 1.4594563 0.72994385 2.918050
##     2: 0.6047873 0.6085603 0.6047873 0.6046839 1.1249515 0.72024775 1.757056
##     3: 0.3542615 0.3652989 0.3542615 0.3541855 1.8710573 0.49706191 7.043097
##     4: 0.3195523 0.3188303 0.3195523 0.3186921 0.5412832 0.16162940 1.812712
##     5: 0.6080318 0.6084157 0.6080318 0.6077573 0.8151935 0.37337733 1.779809
##    ---                                                                      
## 22211: 0.3983666 0.3949865 0.3983666 0.3981244 0.6620237 0.25419512 1.724169
## 22212: 0.9016133 0.9015048 0.9016133 0.9016144 1.0493838 0.48869230 2.253373
## 22213: 0.8691759 0.8691994 0.8691759 0.8691733 1.0561974 0.55103934 2.024453
## 22214: 0.4496526 0.4478541 0.4496526 0.4498007 0.5245510 0.09847349 2.794191
## 22215: 0.9998490 0.9998490 0.9998490 0.9998490 0.9998604 0.23547784 4.245498
##         P.adjust LRT.adjust Wald.adjust LogRank.adjust
##     1: 0.9999969  0.9999969   0.9999969      0.9999969
##     2: 0.9999969  0.9999969   0.9999969      0.9999969
##     3: 0.9999969  0.9999969   0.9999969      0.9999969
##     4: 0.9999969  0.9999969   0.9999969      0.9999969
##     5: 0.9999969  0.9999969   0.9999969      0.9999969
##    ---                                                
## 22211: 0.9999969  0.9999969   0.9999969      0.9999969
## 22212: 0.9999969  0.9999969   0.9999969      0.9999969
## 22213: 0.9999969  0.9999969   0.9999969      0.9999969
## 22214: 0.9999969  0.9999969   0.9999969      0.9999969
## 22215: 0.9999969  0.9999969   0.9999969      0.9999969

We now take the top probes from the model by Log Rank p-value and use biomaRt to look up the corresponding gene symbols.

not run

  res5 <- res5[order(res5$LogRank, decreasing = FALSE),]
  final <- subset(res5, LogRank < 0.01)
  probes <- gsub('^X', '', final$Variable)
  library(biomaRt)
  mart <- useMart('ENSEMBL_MART_ENSEMBL', host = 'useast.ensembl.org')
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  annotLookup <- getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = probes,
    uniqueRows = TRUE)

Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was previously associated with good progression free and overall survival in breast cancer in (doi: 10.1016/j.cca.2018.05.041.)[https://www.ncbi.nlm.nih.gov/pubmed/29800557] , whilst high expression of MMP10 was associated with poor prognosis in colon cancer in (doi: 10.1186/s12885-016-2515-7)[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4950722/].

We can further explore the role of these genes to RFS by dividing their gene expression Z-scores into tertiles for low, mid, and high expression:

  # extract RFS and probe data for downstream analysis
    survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
      'X203666_at', 'X205680_at')]
    colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
      'CXCL12', 'MMP10')

  # set Z-scale cut-offs for high and low expression
    highExpr <- 1.0
    lowExpr <- 1.0

  # encode the expression for CXCL12 and MMP10 as low, mid, and high
    survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
      ifelse(x <= lowExpr, 'Low', 'Mid'))
    survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
      ifelse(x <= lowExpr, 'Low', 'Mid'))

  # relevel the factors to have mid as the reference level
    survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
      levels = c('Mid', 'Low', 'High'))
    survplotdata$MMP10 <- factor(survplotdata$MMP10,
      levels = c('Mid', 'Low', 'High'))

Plot the survival curves and place Log Rank p-value in the plots:

  library(survminer)
  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)
Survival analysis via Cox Proportional Hazards regression.

Survival analysis via Cox Proportional Hazards regression.

  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)
Survival analysis via Cox Proportional Hazards regression.

Survival analysis via Cox Proportional Hazards regression.

3.4 Perform a conditional logistic regression

In this example, we will re-use the Cox data for the purpose of performing conditional logistic regression with tumour grade as our grouping / matching factor. For this example, we will use ER status as the dependent variable and also adjust for age.

  x <- exprs(gset[[1]])
  x <- x[-grep('^AFFX', rownames(x)),]
  x <- scale(x)
  x <- x[,which(colnames(x) %in% rownames(metadata))]

  coxdata <- data.frame(metadata, t(x))

  colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
    'GGI', 'Grade', 'Node',
    'Size', 'Time.RFS')

  coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
  coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
  coxdata$ER <- as.numeric(coxdata$ER)
  coxdata <- coxdata[!is.na(coxdata$ER),]

  res6 <- RegParallel(
    data = coxdata,
    formula = 'ER ~ [*] + Age + strata(Grade)',
    FUN = function(formula, data)
      clogit(formula = formula,
        data = data,
        method = 'breslow'),
    FUNtype = 'clogit',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000)

  subset(res6, P < 0.01)
##        Variable         Term       Beta StandardError         Z           P
## 1:   X204667_at   X204667_at  0.9940504     0.3628087  2.739875 0.006146252
## 2:   X205225_at   X205225_at  0.4444556     0.1633857  2.720285 0.006522559
## 3: X207813_s_at X207813_s_at  0.8218501     0.3050777  2.693904 0.007062046
## 4:   X212108_at   X212108_at  1.9610211     0.7607284  2.577820 0.009942574
## 5: X219497_s_at X219497_s_at -1.0249671     0.3541401 -2.894242 0.003800756
##            LRT       Wald    LogRank        HR   HRlower   HRupper
## 1: 0.006808415 0.02212540 0.02104525 2.7021573 1.3270501  5.502169
## 2: 0.010783544 0.01941078 0.01701248 1.5596409 1.1322713  2.148319
## 3: 0.037459927 0.02449358 0.02424809 2.2747043 1.2509569  4.136257
## 4: 0.033447973 0.03356050 0.03384960 7.1065797 1.6000274 31.564132
## 5: 0.005153233 0.01387183 0.01183245 0.3588083 0.1792329  0.718302

not run

  getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = c('204667_at',
      '205225_at',
      '207813_s_at',
      '212108_at',
      '219497_s_at'),
    uniqueRows=TRUE)

Oestrogen receptor (ESR1) comes out - makes sense! Also, although 204667_at is not listed in biomaRt, it overlaps an exon of FOXA1, which also makes sense in relation to oestrogen signalling.

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  7480074 399.5   13134455 701.5 13134455 701.5
## Vcells 15240683 116.3   52749813 402.5 52749801 402.5

4 Advanced features

Advanced features include the ability to modify block size, choose different numbers of cores, enable ‘nested’ parallel processing, modify limits for confidence intervals, and exclude certain model terms from output.

4.1 Speed up processing

First create some test data for the purpose of benchmarking:

  options(scipen=10)
  options(digits=6)

  # create a data-matrix of 20 x 60000 (rows x cols) random numbers
  col <- 60000
  row <- 20
  mat <- matrix(
    rexp(col*row, rate = .1),
    ncol = col)

  # add fake gene and sample names
  colnames(mat) <- paste0('gene', 1:ncol(mat))

  rownames(mat) <- paste0('sample', 1:nrow(mat))

  # add some fake metadata
  modelling <- data.frame(
    cell = rep(c('B', 'T'), nrow(mat) / 2),
    group = c(rep(c('treatment'), nrow(mat) / 2), rep(c('control'), nrow(mat) / 2)),
    dosage = t(data.frame(matrix(rexp(row, rate = 1), ncol = row))),
    mat,
    row.names = rownames(mat))

4.1.1 ~2000 tests; blocksize, 500; cores, 2; nestedParallel, TRUE

With 2 cores instead of the default of 3, coupled with nestedParallel being enabled, a total of 2 x 2 = 4 threads will be used.

  df <- modelling[ ,1:2000]
  variables <- colnames(df)[4:ncol(df)]

  ptm <- proc.time()

  res <- RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 500,
    cores = 2,
    nestedParallel = TRUE,
    p.adjust = "BY")

  proc.time() - ptm
##    user  system elapsed 
##   3.606   1.985   2.884

4.1.2 ~2000 tests; blocksize, 500; cores, 2; nestedParallel, FALSE

  df <- modelling[ ,1:2000]
  variables <- colnames(df)[4:ncol(df)]

  ptm <- proc.time()

  res <- RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    p.adjust = "BY")

  proc.time() - ptm
##    user  system elapsed 
##   0.368   0.093   3.667

Focusing on the elapsed time (as system time only reports time from the last core that finished), we can see that nested processing has negligible improvement or may actually be slower under certain conditions when tested over a small number of variables. This is likely due to the system being slowed by simply managing the larger number of threads. Nested processing’s benefits can only be gained when processing a large number of variables:

4.1.3 ~40000 tests; blocksize, 2000; cores, 2; nestedParallel, TRUE

  df <- modelling[ ,1:40000]
  variables <- colnames(df)[4:ncol(df)]

  system.time(RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 2000,
    cores = 2,
    nestedParallel = TRUE))
##    user  system elapsed 
##  85.949  15.885  52.061

4.1.4 ~40000 tests; blocksize, 2000; cores, 2; nestedParallel, FALSE

  df <- modelling[,1:40000]
  variables <- colnames(df)[4:ncol(df)]

  system.time(RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE))
##    user  system elapsed 
##  78.798   0.724  79.579

Performance is system-dependent and even increasing cores may not result in huge gains in time. Performance is a trade-off between cores, forked threads, blocksize, and the number of terms in each model.

4.1.5 ~40000 tests; blocksize, 5000; cores, 3; nestedParallel, TRUE

In this example, we choose a large blocksize and 3 cores. With nestedParallel enabled, this translates to 9 simultaneous threads.

  df <- modelling[,1:40000]
  variables <- colnames(df)[4:ncol(df)]

  system.time(RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 5000,
    cores = 3,
    nestedParallel = TRUE))
##    user  system elapsed 
## 158.711  16.043  38.474

4.2 Modify confidence intervals

  df <- modelling[ ,1:500]
  variables <- colnames(df)[4:ncol(df)]

  # 99% confidfence intervals
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 99)
##       Variable         Term       Beta StandardError         Z        P
##    1:    gene1        gene1  0.1329092     0.0912635  1.456323 0.145303
##    2:    gene1 cellB:dosage -1.3096942     1.2492116 -1.048417 0.294447
##    3:    gene1 cellT:dosage -4.5897182     2.8700108 -1.599199 0.109776
##    4:    gene2        gene2  0.0214360     0.0573548  0.373744 0.708595
##    5:    gene2 cellB:dosage -1.5417132     1.5957627 -0.966129 0.333979
##   ---                                                                  
## 1487:  gene496 cellB:dosage -1.0818026     1.0899067 -0.992564 0.320922
## 1488:  gene496 cellT:dosage -1.3387882     2.1278417 -0.629177 0.529233
## 1489:  gene497      gene497 -0.0808694     0.0744356 -1.086435 0.277287
## 1490:  gene497 cellB:dosage -1.2960068     1.3574211 -0.954757 0.339701
## 1491:  gene497 cellT:dosage -3.0993136     2.3531202 -1.317108 0.187802
##              OR       ORlower  ORupper
##    1: 1.1421463 0.90287596606  1.44483
##    2: 0.2699026 0.01080820265  6.74001
##    3: 0.0101557 0.00000625346 16.49306
##    4: 1.0216674 0.88135025789  1.18432
##    5: 0.2140141 0.00351004843 13.04884
##   ---                                 
## 1487: 0.3389839 0.02046137720  5.61595
## 1488: 0.2621632 0.00109199908 62.93917
## 1489: 0.9223141 0.76139521710  1.11724
## 1490: 0.2736222 0.00829176964  9.02933
## 1491: 0.0450801 0.00010510512 19.33510
  # 95% confidfence intervals (default)
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 95)
##       Variable         Term       Beta StandardError         Z        P
##    1:    gene1        gene1  0.1329092     0.0912635  1.456323 0.145303
##    2:    gene1 cellB:dosage -1.3096942     1.2492116 -1.048417 0.294447
##    3:    gene1 cellT:dosage -4.5897182     2.8700108 -1.599199 0.109776
##    4:    gene2        gene2  0.0214360     0.0573548  0.373744 0.708595
##    5:    gene2 cellB:dosage -1.5417132     1.5957627 -0.966129 0.333979
##   ---                                                                  
## 1487:  gene496 cellB:dosage -1.0818026     1.0899067 -0.992564 0.320922
## 1488:  gene496 cellT:dosage -1.3387882     2.1278417 -0.629177 0.529233
## 1489:  gene497      gene497 -0.0808694     0.0744356 -1.086435 0.277287
## 1490:  gene497 cellB:dosage -1.2960068     1.3574211 -0.954757 0.339701
## 1491:  gene497 cellT:dosage -3.0993136     2.3531202 -1.317108 0.187802
##              OR      ORlower  ORupper
##    1: 1.1421463 0.9550763043  1.36586
##    2: 0.2699026 0.0233279318  3.12275
##    3: 0.0101557 0.0000366229  2.81623
##    4: 1.0216674 0.9130384130  1.14322
##    5: 0.2140141 0.0093783583  4.88380
##   ---                                
## 1487: 0.3389839 0.0400358300  2.87018
## 1488: 0.2621632 0.0040490162 16.97438
## 1489: 0.9223141 0.7971117208  1.06718
## 1490: 0.2736222 0.0191298893  3.91373
## 1491: 0.0450801 0.0004477191  4.53905

4.3 Remove some terms from output / include the intercept

  # remove terms but keep Intercept
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 95,
    excludeTerms = c('cell', 'dosage'),
    excludeIntercept = FALSE)
##      Variable        Term       Beta StandardError          Z        P       OR
##   1:    gene1 (Intercept)  0.0701097     0.9059179  0.0773908 0.938313 1.072626
##   2:    gene1       gene1  0.1329092     0.0912635  1.4563232 0.145303 1.142146
##   3:    gene2 (Intercept)  0.6971682     0.8618191  0.8089496 0.418544 2.008058
##   4:    gene2       gene2  0.0214360     0.0573548  0.3737443 0.708595 1.021667
##   5:    gene3 (Intercept)  0.8881033     0.8686998  1.0223363 0.306622 2.430515
##  ---                                                                           
## 990:  gene495     gene495  0.0188519     0.0450617  0.4183579 0.675685 1.019031
## 991:  gene496 (Intercept)  1.1487421     0.8951623  1.2832780 0.199395 3.154223
## 992:  gene496     gene496 -0.0378408     0.0477754 -0.7920550 0.428329 0.962866
## 993:  gene497 (Intercept)  1.7265409     1.1832521  1.4591488 0.144524 5.621176
## 994:  gene497     gene497 -0.0808694     0.0744356 -1.0864350 0.277287 0.922314
##       ORlower  ORupper
##   1: 0.181689  6.33238
##   2: 0.955076  1.36586
##   3: 0.370847 10.87322
##   4: 0.913038  1.14322
##   5: 0.442853 13.33942
##  ---                  
## 990: 0.932891  1.11312
## 991: 0.545668 18.23290
## 992: 0.876798  1.05738
## 993: 0.552893 57.14961
## 994: 0.797112  1.06718
  # remove everything but the variable being tested
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 95,
    excludeTerms = c('cell', 'dosage'),
    excludeIntercept = TRUE)
##      Variable    Term        Beta StandardError         Z        P       OR
##   1:    gene1   gene1  0.13290921     0.0912635  1.456323 0.145303 1.142146
##   2:    gene2   gene2  0.02143603     0.0573548  0.373744 0.708595 1.021667
##   3:    gene3   gene3 -0.00748424     0.0349805 -0.213955 0.830582 0.992544
##   4:    gene4   gene4 -0.04003979     0.0557372 -0.718368 0.472531 0.960751
##   5:    gene5   gene5  0.03574641     0.0438673  0.814877 0.415143 1.036393
##  ---                                                                       
## 493:  gene493 gene493 -0.01156767     0.0404666 -0.285857 0.774988 0.988499
## 494:  gene494 gene494  0.00757210     0.0675239  0.112140 0.910713 1.007601
## 495:  gene495 gene495  0.01885191     0.0450617  0.418358 0.675685 1.019031
## 496:  gene496 gene496 -0.03784075     0.0477754 -0.792055 0.428329 0.962866
## 497:  gene497 gene497 -0.08086941     0.0744356 -1.086435 0.277287 0.922314
##       ORlower ORupper
##   1: 0.955076 1.36586
##   2: 0.913038 1.14322
##   3: 0.926775 1.06298
##   4: 0.861326 1.07165
##   5: 0.951009 1.12944
##  ---                 
## 493: 0.913127 1.07009
## 494: 0.882698 1.15018
## 495: 0.932891 1.11312
## 496: 0.876798 1.05738
## 497: 0.797112 1.06718

5 Acknowledgments

Thanks to Horácio Montenegro and GenoMax for testing cross-platform differences, and Wolfgang Huber for providing the nudge that FDR correction needed to be implemented.

Thanks to Michael Barnes in London for introducing me to parallel processing in R.

Finally, thanks to Juan Celedón at Children’s Hospital of Pittsburgh.

Sarega Gurudas, whose suggestion led to the implementation of survey weights via svyglm.

6 Session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] survminer_0.4.9             ggpubr_0.6.0               
##  [3] ggplot2_3.4.4               GEOquery_2.70.0            
##  [5] DESeq2_1.42.0               magrittr_2.0.3             
##  [7] airway_1.22.0               SummarizedExperiment_1.32.0
##  [9] Biobase_2.62.0              GenomicRanges_1.54.0       
## [11] GenomeInfoDb_1.38.0         IRanges_2.36.0             
## [13] S4Vectors_0.40.0            BiocGenerics_0.48.0        
## [15] MatrixGenerics_1.14.0       matrixStats_1.0.0          
## [17] RegParallel_1.20.0          arm_1.13-1                 
## [19] lme4_1.1-34                 Matrix_1.6-1.1             
## [21] MASS_7.3-60                 survival_3.5-7             
## [23] stringr_1.5.0               data.table_1.14.8          
## [25] doParallel_1.0.17           iterators_1.0.14           
## [27] foreach_1.5.2               knitr_1.44                 
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7            gridExtra_2.3           rlang_1.1.1            
##  [4] compiler_4.3.1          vctrs_0.6.4             pkgconfig_2.0.3        
##  [7] crayon_1.5.2            fastmap_1.1.1           backports_1.4.1        
## [10] XVector_0.42.0          labeling_0.4.3          KMsurv_0.1-5           
## [13] utf8_1.2.4              rmarkdown_2.25          markdown_1.11          
## [16] tzdb_0.4.0              nloptr_2.0.3            purrr_1.0.2            
## [19] xfun_0.40               zlibbioc_1.48.0         cachem_1.0.8           
## [22] jsonlite_1.8.7          DelayedArray_0.28.0     BiocParallel_1.36.0    
## [25] broom_1.0.5             R6_2.5.1                bslib_0.5.1            
## [28] stringi_1.7.12          limma_3.58.0            car_3.1-2              
## [31] boot_1.3-28.1           jquerylib_0.1.4         Rcpp_1.0.11            
## [34] zoo_1.8-12              R.utils_2.12.2          readr_2.1.4            
## [37] splines_4.3.1           tidyselect_1.2.0        abind_1.4-5            
## [40] yaml_2.3.7              ggtext_0.1.2            codetools_0.2-19       
## [43] curl_5.1.0              lattice_0.22-5          tibble_3.2.1           
## [46] withr_2.5.1             coda_0.19-4             evaluate_0.22          
## [49] xml2_1.3.5              survMisc_0.5.6          pillar_1.9.0           
## [52] carData_3.0-5           generics_0.1.3          RCurl_1.98-1.12        
## [55] hms_1.1.3               commonmark_1.9.0        munsell_0.5.0          
## [58] scales_1.2.1            minqa_1.2.6             xtable_1.8-4           
## [61] glue_1.6.2              tools_4.3.1             locfit_1.5-9.8         
## [64] ggsignif_0.6.4          grid_4.3.1              tidyr_1.3.0            
## [67] colorspace_2.1-0        nlme_3.1-163            GenomeInfoDbData_1.2.11
## [70] cli_3.6.1               km.ci_0.5-6             fansi_1.0.5            
## [73] S4Arrays_1.2.0          dplyr_1.1.3             gtable_0.3.4           
## [76] R.methodsS3_1.8.2       rstatix_0.7.2           sass_0.4.7             
## [79] digest_0.6.33           SparseArray_1.2.0       farver_2.1.1           
## [82] htmltools_0.5.6.1       R.oo_1.25.0             lifecycle_1.0.3        
## [85] statmod_1.5.0           gridtext_0.1.5

7 References

Blighe and Lasky-Su (2018)

Blighe, K, and J Lasky-Su. 2018. “RegParallel: Standard regression functions in R enabled for parallel processing over large data-frames.” https://github.com/kevinblighe/RegParallel.