1 Overview

The batchelor package provides the batchCorrect() generic that dispatches on its PARAM argument. Users writing code using batchCorrect() can easily change one method for another by simply modifying the class of object supplied as PARAM. For example:

B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 
B2 <- matrix(rnorm(10000), ncol=50) # Batch 2

# Switching easily between batch correction methods.
m.out <- batchCorrect(B1, B2, PARAM=ClassicMnnParam())
f.out <- batchCorrect(B1, B2, PARAM=FastMnnParam(d=20))
r.out <- batchCorrect(B1, B2, PARAM=RescaleParam(pseudo.count=0))

Developers of other packages can extend this further by adding their batch correction methods to this dispatch system. This improves interoperability across packages by allowing users to easily experiment with different methods.

2 Setting up

You will need to Imports: batchelor, methods in your DESCRIPTION file. You will also need to add import(methods), importFrom(batchelor, "batchCorrect") and importClassesFrom(batchelor, "BatchelorParam") in your NAMESPACE file.

Obviously, you will also need to have a function that implements your batch correction method. For demonstration purposes, we will use an identity function that simply returns the input values1 Not a very good correction, but that’s not the point right now.. This is implemented like so:

noCorrect <- function(...) 
# Takes a set of batches and returns them without modification. 
{
   do.call(cbind, list(...)) 
}

3 Deriving a BatchelorParam subclass

We need to define a new BatchelorParam subclass that instructs the batchCorrect() generic to dispatch to our new method. This is most easily done like so:

NothingParam <- setClass("NothingParam", contains="BatchelorParam")

Note that BatchelorParam itself is derived from a SimpleList and can be modified with standard list operators like $.

nothing <- NothingParam()
nothing
## NothingParam of length 0
nothing$some_value <- 1
nothing
## NothingParam of length 1
## names(1): some_value

If no parameters are set, the default values in the function will be used2 Here there are none in noCorrect(), but presumably your function is more complex than that.. Additional slots can be specified in the class definition if there are important parameters that need to be manually specified by the user.

4 Defining a batchCorrect method

4.1 Input

The batchCorrect() generic looks like this:

batchCorrect
## standardGeneric for "batchCorrect" defined from package "batchelor"
## 
## function (..., batch = NULL, restrict = NULL, subset.row = NULL, 
##     correct.all = FALSE, assay.type = NULL, get.spikes = FALSE, 
##     PARAM) 
## standardGeneric("batchCorrect")
## <bytecode: 0x5594aa57fc28>
## <environment: 0x5594aa561148>
## Methods may be defined for arguments: PARAM
## Use  showMethods("batchCorrect")  for currently available ones.

Any implemented method must accept one or more matrix-like objects containing single-cell gene expression matrices in .... Rows are assumed to be genes and columns are assumed to be cells. If only one object is supplied, batch must be specified to indicate the batches to which each cell belongs.

Alternatively, one or more SingleCellExperiment objects can be supplied, containing the gene expression matrix in the assay.type assay. These should not be mixed with matrix-like objects, i.e., if one object is a SingleCellExperiment, all objects should be SingleCellExperiments.

The subset.row= argument specifies a subset of genes on which to perform the correction. The get.spikes= argument specifies whether spike-in transcripts should be used for SingleCellExperiment inputs. The correct.all= argument specifies whether corrected values should be returned for all genes, by “extrapolating” from the results for the genes that were used3 If your method cannot support this option, setting it to TRUE should raise an error.. See the Output section below for the expected output from each combination of these settings.

The restrict= argument allows users to compute the correction using only a subset of cells in each batch (e.g., from cell controls). The correction is then “extrapolated” to all cells in the batch4 Again, if your method cannot support this, any non-NULL value of restrict should raise an error., such that corrected values are returned for all cells.

4.2 Output

Any implemented method must return a SingleCellExperiment where the first assay contains corrected gene expression values for all genes. Corrected values should be returned:

  • for all genes, if correct.all=TRUE for any input.
  • for all genes, if subset.row=NULL for matrix-like inputs.
  • for selected genes, if subset.row is not NULL for matrix-like inputs.
  • for all genes, if subset.row=NULL and get.spikes=TRUE for SingleCellExperiment inputs.
  • for all non-spike-in transcripts, if subset.row=NULL and get.spikes=FALSE for SingleCellExperiment inputs.
  • for selected genes, if subset.row is not NULL and get.spikes=FALSE for matrix-like inputs.
  • for selected genes that are not spike-in transcripts, if subset.row is not NULL and get.spikes=FALSE for SingleCellExperiment inputs.

This looks more complicated than it really is, as developers can just modify subset.row to exclude spike-ins when get.spikes=FALSE.

Cells should be reported in the same order that they are supplied in .... In cases with multiple batches, the cell identities are simply concatenated from successive objects in their specified order, i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on. If there is only a single batch, the order of cells in that batch should be preserved.

The output object should have row names equal to the row names of the input objects. Column names should be equivalent to the concatenated column names of the input objects, unless all are NULL, in which case the column names in the output can be NULL. In situations where some input objects have column names, and others are NULL, those missing column names should be filled in with empty strings. This represents the expected behaviour when cbinding multiple matrices together.

Finally, the colData slot should contain ‘batch’, a vector specifying the batch of origin for each cell.

4.3 Demonstration

Finally, we define a method that calls our noCorrect function while satisfying all of the above input/output requirements. To be clear, it is not mandatory to lay out the code as shown below; this is simply one way that all the requirements can be satisfied. We have used some internal batchelor functions for brevity - please contact us if you find these useful and want them to be exported.

setMethod("batchCorrect", "NothingParam", function(..., batch = NULL, 
    restrict=NULL, subset.row = NULL, correct.all = FALSE, 
    assay.type = "logcounts", get.spikes = FALSE, PARAM) 
{
    batches <- list(...)
    checkBatchConsistency(batches)

    # Pulling out information from the SCE objects.        
    is.sce <- checkIfSCE(batches)
    if (any(is.sce)) {
        sce.batches <- batches[is.sce]
        checkSpikeConsistency(sce.batches)
        subset.row <- batchelor:::.SCE_subset_genes(subset.row, sce.batches[[1]], get.spikes)
        batches[is.sce] <- lapply(sce.batches, assay, i=assay.type)
    }

    # Subsetting by 'batch', if only one object is supplied. 
    do.split <- length(batches)==1L
    if (do.split) {
        divided <- divideIntoBatches(batches[[1]], batch=batch, restrict=restrict)
        batches <- divided$batches
        restrict <- divided$restricted
    } 

    # Subsetting by row.
    # This is a per-gene "method", so correct.all=TRUE will ignore subset.row.
    # More complex methods will need to handle this differently.
    if (correct.all) {
        subset.row <- NULL
    } else if (!is.null(subset.row)) {
        subset.row <- normalizeSingleBracketSubscript(originals[[1]], subset.row)
        batches <- lapply(batches, "[", i=subset.row, , drop=FALSE)
    }

    # Don't really need to consider restrict!=NULL here, as this function
    # doesn't do anything with the cells anyway.
    output <- do.call(noCorrect, batches)

    # Reordering the output for correctness if it was previously split.
    if (do.split) {
        d.reo <- divided$reorder
        output <- output[,d.reo,drop=FALSE]
    }

    ncells.per.batch <- vapply(batches, FUN=ncol, FUN.VALUE=0L)
    batch.names <- names(batches)
    if (is.null(batch.names)) {
        batch.names <- seq_along(batches)
    }
    
    SingleCellExperiment(list(corrected=output), 
        colData=DataFrame(batch=rep(batch.names, ncells.per.batch)))
})

And it works5 In a strictly programming sense, as the method itself does no correction at all.:

n.out <- batchCorrect(B1, B2, PARAM=NothingParam())
n.out
## class: SingleCellExperiment 
## dim: 200 100 
## metadata(0):
## assays(1): corrected
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(1): batch
## reducedDimNames(0):
## spikeNames(0):

Remember to export both the new method and the NothingParam class and constructor.

5 Session information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.13.5               ggplot2_3.1.1              
##  [3] scRNAseq_1.11.0             batchelor_1.1.4            
##  [5] SingleCellExperiment_1.7.0  SummarizedExperiment_1.15.1
##  [7] DelayedArray_0.11.0         BiocParallel_1.19.0        
##  [9] matrixStats_0.54.0          Biobase_2.45.0             
## [11] GenomicRanges_1.37.8        GenomeInfoDb_1.21.1        
## [13] IRanges_2.19.6              S4Vectors_0.23.4           
## [15] BiocGenerics_0.31.2         knitr_1.23                 
## [17] BiocStyle_2.13.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1               rsvd_1.0.0              
##  [3] lattice_0.20-38          assertthat_0.2.1        
##  [5] digest_0.6.19            R6_2.4.0                
##  [7] plyr_1.8.4               evaluate_0.14           
##  [9] pillar_1.4.1             zlibbioc_1.31.0         
## [11] rlang_0.3.4              lazyeval_0.2.2          
## [13] irlba_2.3.3              Matrix_1.2-17           
## [15] rmarkdown_1.13           labeling_0.3            
## [17] BiocNeighbors_1.3.1      stringr_1.4.0           
## [19] RCurl_1.95-4.12          munsell_0.5.0           
## [21] compiler_3.6.0           vipor_0.4.5             
## [23] BiocSingular_1.1.3       xfun_0.7                
## [25] pkgconfig_2.0.2          ggbeeswarm_0.6.0        
## [27] htmltools_0.3.6          tidyselect_0.2.5        
## [29] tibble_2.1.2             gridExtra_2.3           
## [31] GenomeInfoDbData_1.2.1   bookdown_0.11           
## [33] viridisLite_0.3.0        crayon_1.3.4            
## [35] dplyr_0.8.1              withr_2.1.2             
## [37] bitops_1.0-6             grid_3.6.0              
## [39] gtable_0.3.0             magrittr_1.5            
## [41] scales_1.0.0             stringi_1.4.3           
## [43] XVector_0.25.0           viridis_0.5.1           
## [45] DelayedMatrixStats_1.7.0 cowplot_0.9.4           
## [47] tools_3.6.0              glue_1.3.1              
## [49] beeswarm_0.2.3           purrr_0.3.2             
## [51] yaml_2.2.0               colorspace_1.4-1        
## [53] BiocManager_1.30.4