Chapter 23 Dealing with big data

23.1 Motivation

Advances in scRNA-seq technologies have increased the number of cells that can be assayed in routine experiments. Public databases such as GEO are continually expanding with more scRNA-seq studies, while large-scale projects such as the Human Cell Atlas are expected to generate data for billions of cells. For effective data analysis, the computational methods need to scale with the increasing size of scRNA-seq data sets. This section discusses how we can use various aspects of the Bioconductor ecosystem to tune our analysis pipelines for greater speed and efficiency.

23.2 Fast approximations

23.2.1 Nearest neighbor searching

Identification of neighbouring cells in PC or expression space is a common procedure that is used in many functions, e.g., buildSNNGraph(), doubletCells(). The default is to favour accuracy over speed by using an exact nearest neighbour (NN) search, implemented with the \(k\)-means for \(k\)-nearest neighbours algorithm (Wang 2012). However, for large data sets, it may be preferable to use a faster approximate approach. The BiocNeighbors framework makes it easy to switch between search options by simply changing the BNPARAM= argument in compatible functions. To demonstrate, we will use the 10X PBMC data:

#--- loading ---#
library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)

library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)

#--- gene-annotation ---#
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")

#--- cell-detection ---#
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

#--- quality-control ---#
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

#--- normalization ---#
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)

#--- variance-modelling ---#
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

#--- dimensionality-reduction ---#
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)

set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")

set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")

#--- clustering ---#
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.pbmc) <- factor(clust)
## class: SingleCellExperiment 
## dim: 33694 3985 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(4): Sample Barcode sizeFactor label
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(0):

We had previously clustered on a shared nearest neighbor graph generated with an exact neighbour search (Section 10.3). We repeat this below using an approximate search, implemented using the Annoy algorithm. This involves constructing a AnnoyParam object to specify the search algorithm and then passing it to the buildSNNGraph() function. The results from the exact and approximate searches are consistent with most clusters from the former re-appearing in the latter. This suggests that the inaccuracy from the approximation can be largely ignored.

##      Approx
## Exact   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
##    1  205   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    2    0 479   0   0   2   0   0   0   0  27   0   0   0   0   0   0
##    3    0   0 540   0   1   0   0   0   0   0   0   0   0   0   0   0
##    4    0   0   0  55   0   0   0   0   0   1   0   0   0   0   0   0
##    5    0  25   0   0 349   0   0   0   0   0   0   0   0   0   0   0
##    6    0   0   0   0   0 125   0   0   0   0   0   0   0   0   0   0
##    7    0   0   0   0   0   0  46   0   0   0   0   0   0   0   0   0
##    8    0   0   0   0   0   0   0 432   0   0   0   0   0   0   0   0
##    9    0   0   0   1   0   0   0  10 291   0   0   0   0   0   0   0
##    10   0  28   0   0   0   0   0   0   0 839   0   0   0   0   0   0
##    11   0   0   0   0   0   0   0   0   0   0  47   0   0   0   0   0
##    12   0   0   0   0   0   0   0   0   0   0   0 155   0   0   0   0
##    13   0   0   0   0   0   0   0   0   0   0   0   0 166   0   0   0
##    14   0   0   0   0   0   0   0   0   0   0   0   0   0  61   0   0
##    15   0   0   0   0   0   0   0   0   0   0   0   0   0   0  84   0
##    16   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  16

Note that Annoy writes the NN index to disk prior to performing the search. Thus, it may not actually be faster than the default exact algorithm for small datasets, depending on whether the overhead of disk write is offset by the computational complexity of the search. It is also not difficult to find situations where the approximation deteriorates, especially at high dimensions, though this may not have an appreciable impact on the biological conclusions.

## [1] 0.5619

23.2.2 Singular value decomposition

The singular value decomposition (SVD) underlies the PCA used throughout our analyses, e.g., in denoisePCA(), fastMNN(), doubletCells(). (Briefly, the right singular vectors are the eigenvectors of the gene-gene covariance matrix, where each eigenvector represents the axis of maximum remaining variation in the PCA.) The default base::svd() function performs an exact SVD that is not performant for large datasets. Instead, we use fast approximate methods from the irlba and rsvd packages, conveniently wrapped into the BiocSingular package for ease of use and package development. Specifically, we can change the SVD algorithm used in any of these functions by simply specifying an alternative value for the BSPARAM= argument.

##  num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3985] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "varExplained")= num [1:20] 85.36 40.43 23.22 8.99 6.66 ...
##  - attr(*, "percentVar")= num [1:20] 19.85 9.4 5.4 2.09 1.55 ...
##  - attr(*, "rotation")= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:500] "LYZ" "S100A9" "S100A8" "HLA-DRA" ...
##   .. ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3985] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "varExplained")= num [1:20] 85.36 40.43 23.22 8.99 6.66 ...
##  - attr(*, "percentVar")= num [1:20] 19.85 9.4 5.4 2.09 1.55 ...
##  - attr(*, "rotation")= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:500] "LYZ" "S100A9" "S100A8" "HLA-DRA" ...
##   .. ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...

Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with negligible loss of accuracy. This motivates their default use in many scran and scater functions, at the cost of requiring users to set the seed to guarantee reproducibility. IRLBA can occasionally fail to converge and require more iterations (passed via maxit= in IrlbaParam()), while RSVD involves an explicit trade-off between accuracy and speed based on its oversampling parameter (p=) and number of power iterations (q=). We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is much faster for file-backed matrices (Section 28.8).

23.3 Parallelization

Parallelization of calculations across genes or cells is an obvious strategy for speeding up scRNA-seq analysis workflows. The BiocParallel package provides a common interface for parallel computing throughout the Bioconductor ecosystem, manifesting as a BPPARAM= argument in compatible functions. We can pick from a diverse range of parallelization backends depending on the available hardware and operating system. For example, we might use forking across 2 cores to parallelize the variance calculations on a Unix system:

## DataFrame with 33694 rows and 6 columns
##                     mean       total        tech          bio   p.value
##                <numeric>   <numeric>   <numeric>    <numeric> <numeric>
## RP11-34P13.3 0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## FAM138A      0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## OR4F5        0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## RP11-34P13.7 0.002166050 0.002227438 0.002227466 -2.81903e-08  0.500035
## RP11-34P13.8 0.000522431 0.000549601 0.000537244  1.23571e-05  0.436506
## ...                  ...         ...         ...          ...       ...
## AC233755.2     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC233755.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC240274.1     0.0102893   0.0121099   0.0105809   0.00152898  0.157651
## AC213203.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## FAM231B        0.0000000   0.0000000   0.0000000   0.00000000       NaN
##                    FDR
##              <numeric>
## RP11-34P13.3       NaN
## FAM138A            NaN
## OR4F5              NaN
## RP11-34P13.7  0.756451
## RP11-34P13.8  0.756451
## ...                ...
## AC233755.2         NaN
## AC233755.1         NaN
## AC240274.1    0.756451
## AC213203.1         NaN
## FAM231B            NaN

Another approach would be to distribute jobs across a network of computers, which yields the same result:

## DataFrame with 33694 rows and 6 columns
##                     mean       total        tech          bio   p.value
##                <numeric>   <numeric>   <numeric>    <numeric> <numeric>
## RP11-34P13.3 0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## FAM138A      0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## OR4F5        0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## RP11-34P13.7 0.002166050 0.002227438 0.002227466 -2.81903e-08  0.500035
## RP11-34P13.8 0.000522431 0.000549601 0.000537244  1.23571e-05  0.436506
## ...                  ...         ...         ...          ...       ...
## AC233755.2     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC233755.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC240274.1     0.0102893   0.0121099   0.0105809   0.00152898  0.157651
## AC213203.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## FAM231B        0.0000000   0.0000000   0.0000000   0.00000000       NaN
##                    FDR
##              <numeric>
## RP11-34P13.3       NaN
## FAM138A            NaN
## OR4F5              NaN
## RP11-34P13.7  0.756451
## RP11-34P13.8  0.756451
## ...                ...
## AC233755.2         NaN
## AC233755.1         NaN
## AC240274.1    0.756451
## AC213203.1         NaN
## FAM231B            NaN

For high-performance computing (HPC) systems with a cluster of compute nodes, we can distribute jobs via the job scheduler using the BatchtoolsParam class. The example below assumes a SLURM cluster, though the settings can be easily configured for a particular system (see here for details).

Parallelization is best suited for CPU-intensive calculations where the division of labor results in a concomitant reduction in compute time. It is not suited for tasks that are bounded by other compute resources, e.g., memory or file I/O (though the latter is less of an issue on HPC systems with parallel read/write). In particular, R itself is inherently single-core, so many of the parallelization backends involve (i) setting up one or more separate R sessions, (ii) loading the relevant packages and (iii) transmitting the data to that session. Depending on the nature and size of the task, this overhead may outweigh any benefit from parallel computing.

23.4 Out of memory representations

The count matrix is the central structure around which our analyses are based. In most of the previous chapters, this has been held fully in memory as a dense matrix or as a sparse dgCMatrix. Howevever, in-memory representations may not be feasible for very large data sets, especially on machines with limited memory. For example, the 1.3 million brain cell data set from 10X Genomics (Zheng et al. 2017) would require over 100 GB of RAM to hold as a matrix and around 30 GB as a dgCMatrix. This makes it challenging to explore the data on anything less than a HPC system.

The obvious solution is to use a file-backed matrix representation where the data are held on disk and subsets are retrieved into memory as requested. While a number of implementations of file-backed matrices are available (e.g., bigmemory, matter), we will be using the implementation from the HDF5Array package. This uses the popular HDF5 format as the underlying data store, which provides a measure of standardization and portability across systems. We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data set, as provided by the TENxBrainData package.

## class: SingleCellExperiment 
## dim: 27998 20000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): Ensembl Symbol
## colnames: NULL
## colData names(4): Barcode Sequence Library Mouse
## reducedDimNames(0):
## altExpNames(0):

Examination of the SingleCellExperiment object indicates that the count matrix is a HDF5Matrix. From a comparison of the memory usage, it is clear that this matrix object is simply a stub that points to the much larger HDF5 file that actually contains the data. This avoids the need for large RAM availability during analyses.

## <27998 x 20000> matrix of class HDF5Matrix and type "integer":
##              [,1]     [,2]     [,3]     [,4] ... [,19997] [,19998] [,19999]
##     [1,]        0        0        0        0   .        0        0        0
##     [2,]        0        0        0        0   .        0        0        0
##     [3,]        0        0        0        0   .        0        0        0
##     [4,]        0        0        0        0   .        0        0        0
##     [5,]        0        0        0        0   .        0        0        0
##      ...        .        .        .        .   .        .        .        .
## [27994,]        0        0        0        0   .        0        0        0
## [27995,]        0        0        0        1   .        0        2        0
## [27996,]        0        0        0        0   .        0        1        0
## [27997,]        0        0        0        0   .        0        0        0
## [27998,]        0        0        0        0   .        0        0        0
##          [,20000]
##     [1,]        0
##     [2,]        0
##     [3,]        0
##     [4,]        0
##     [5,]        0
##      ...        .
## [27994,]        0
## [27995,]        0
## [27996,]        0
## [27997,]        0
## [27998,]        0
## 2496 bytes
## [1] 76264332

Manipulation of the count matrix will generally result in the creation of a DelayedArray object from the DelayedArray package. This remembers the operations to be applied to the counts and stores them in the object, to be executed when the modified matrix values are realized for use in calculations. The use of delayed operations avoids the need to write the modified values to a new file at every operation, which would unnecessarily require time-consuming disk I/O.

## <27998 x 20000> matrix of class DelayedMatrix and type "double":
##              [,1]     [,2]     [,3] ... [,19999] [,20000]
##     [1,]        0        0        0   .        0        0
##     [2,]        0        0        0   .        0        0
##     [3,]        0        0        0   .        0        0
##     [4,]        0        0        0   .        0        0
##     [5,]        0        0        0   .        0        0
##      ...        .        .        .   .        .        .
## [27994,]        0        0        0   .        0        0
## [27995,]        0        0        0   .        0        0
## [27996,]        0        0        0   .        0        0
## [27997,]        0        0        0   .        0        0
## [27998,]        0        0        0   .        0        0

Many functions described in the previous workflows are capable of accepting HDF5Matrix objects. This is powered by the availability of common methods for all matrix representations (e.g., subsetting, combining, methods from DelayedMatrixStats) as well as representation-agnostic C++ code using beachmat (A. T. L. Lun, Pages, and Smith 2018). For example, we compute QC metrics below with the same calculateQCMetrics() function that we used in the other workflows.

## DataFrame with 20000 rows and 6 columns
##             sum  detected subsets_Mt_sum subsets_Mt_detected subsets_Mt_percent
##       <numeric> <numeric>      <numeric>           <numeric>          <numeric>
## 1          3060      1546            123                  10            4.01961
## 2          3500      1694            118                  11            3.37143
## 3          3092      1613             58                   9            1.87581
## 4          4420      2050            131                  10            2.96380
## 5          3771      1813            100                   8            2.65182
## ...         ...       ...            ...                 ...                ...
## 19996      4431      2050            127                   9           2.866170
## 19997      6988      2704             60                   9           0.858615
## 19998      8749      2988            305                  11           3.486113
## 19999      3842      1711            129                   8           3.357626
## 20000      1775       945             26                   6           1.464789
##           total
##       <numeric>
## 1          3060
## 2          3500
## 3          3092
## 4          4420
## 5          3771
## ...         ...
## 19996      4431
## 19997      6988
## 19998      8749
## 19999      3842
## 20000      1775

Needless to say, data access from file-backed representations is slower than that from in-memory representations. The time spent retrieving data from disk is an unavoidable cost of reducing memory usage. Whether this is tolerable depends on the application. One example usage pattern involves performing the heavy computing quickly with in-memory representations on HPC systems with plentiful memory, and then distributing file-backed counterparts to individual users for exploration and visualization on their personal machines.

Session Info

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TENxBrainData_1.10.0        HDF5Array_1.18.0           
 [3] rhdf5_2.34.0                DelayedArray_0.16.0        
 [5] Matrix_1.2-18               BiocParallel_1.24.0        
 [7] BiocSingular_1.6.0          scater_1.18.0              
 [9] ggplot2_3.3.2               bluster_1.0.0              
[11] BiocNeighbors_1.8.0         scran_1.18.0               
[13] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
[15] Biobase_2.50.0              GenomicRanges_1.42.0       
[17] GenomeInfoDb_1.26.0         IRanges_2.24.0             
[19] S4Vectors_0.28.0            BiocGenerics_0.36.0        
[21] MatrixGenerics_1.2.0        matrixStats_0.57.0         
[23] BiocStyle_2.18.0            rebook_1.0.0               

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0              colorspace_1.4-1             
 [3] ellipsis_0.3.1                scuttle_1.0.0                
 [5] XVector_0.30.0                bit64_4.0.5                  
 [7] AnnotationDbi_1.52.0          interactiveDisplayBase_1.28.0
 [9] codetools_0.2-16              sparseMatrixStats_1.2.0      
[11] knitr_1.30                    dbplyr_1.4.4                 
[13] graph_1.68.0                  shiny_1.5.0                  
[15] BiocManager_1.30.10           compiler_4.0.3               
[17] httr_1.4.2                    dqrng_0.2.1                  
[19] fastmap_1.0.1                 assertthat_0.2.1             
[21] limma_3.46.0                  later_1.1.0.1                
[23] htmltools_0.5.0               tools_4.0.3                  
[25] rsvd_1.0.3                    igraph_1.2.6                 
[27] gtable_0.3.0                  glue_1.4.2                   
[29] GenomeInfoDbData_1.2.4        dplyr_1.0.2                  
[31] rappdirs_0.3.1                Rcpp_1.0.5                   
[33] vctrs_0.3.4                   rhdf5filters_1.2.0           
[35] ExperimentHub_1.16.0          DelayedMatrixStats_1.12.0    
[37] xfun_0.19                     stringr_1.4.0                
[39] ps_1.4.0                      beachmat_2.6.0               
[41] mime_0.9                      lifecycle_0.2.0              
[43] irlba_2.3.3                   statmod_1.4.35               
[45] XML_3.99-0.5                  AnnotationHub_2.22.0         
[47] edgeR_3.32.0                  zlibbioc_1.36.0              
[49] scales_1.1.1                  promises_1.1.1               
[51] yaml_2.2.1                    curl_4.3                     
[53] memoise_1.1.0                 gridExtra_2.3                
[55] stringi_1.5.3                 RSQLite_2.2.1                
[57] BiocVersion_3.12.0            rlang_0.4.8                  
[59] pkgconfig_2.0.3               bitops_1.0-6                 
[61] evaluate_0.14                 lattice_0.20-41              
[63] purrr_0.3.4                   Rhdf5lib_1.12.0              
[65] CodeDepends_0.6.5             bit_4.0.4                    
[67] processx_3.4.4                tidyselect_1.1.0             
[69] magrittr_1.5                  bookdown_0.21                
[71] R6_2.5.0                      snow_0.4-3                   
[73] generics_0.1.0                DBI_1.1.0                    
[75] pillar_1.4.6                  withr_2.3.0                  
[77] RCurl_1.98-1.2                tibble_3.0.4                 
[79] crayon_1.3.4                  BiocFileCache_1.14.0         
[81] rmarkdown_2.5                 viridis_0.5.1                
[83] locfit_1.5-9.4                grid_4.0.3                   
[85] blob_1.2.1                    callr_3.5.1                  
[87] digest_0.6.27                 xtable_1.8-4                 
[89] httpuv_1.5.4                  munsell_0.5.0                
[91] beeswarm_0.2.3                viridisLite_0.3.0            
[93] vipor_0.4.5                  

Bibliography

Lun, A. T. L., H. Pages, and M. L. Smith. 2018. “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol. 14 (5): e1006135.

Wang, X. 2012. “A Fast Exact K-Nearest Neighbors Algorithm for High Dimensional Search Using K-Means Clustering and Triangle Inequality.” Proc Int Jt Conf Neural Netw 43 (6): 2351–8.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.