1 Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of gene expression heterogeneity within complex biological systems. As scRNA-seq technology becomes increasingly accessible and cost-effective, experiments are generating data from larger and larger numbers of cells. However, the analysis of large scRNA-seq data remains a challenge, particularly in terms of scalability. While numerous analysis tools have been developed to tackle the complexities of scRNA-seq data, their scalability is often limited, posing a major bottleneck in the analysis of large-scale experiments. In particular, the R package Seurat is one of the most widely used tools for exploring and analyzing scRNA-seq data, but its scalability is often limited by available memory.

To address this issue, we introduce a new R package called “SCArray.sat” that extends the Seurat classes and functions to support Genomic Data Structure (GDS) files as a DelayedArray backend for data representation. GDS files store multiple dense and sparse array-based data sets in a hierarchical structure. This package defines a new class, called “SCArrayAssay” (derived from the Seurat class “Assay”), which wraps raw counts, normalized expressions, and scaled data matrices based on GDS-specific DelayedMatrix. It is designed to integrate seamlessly with the Seurat package to provide common data analysis in a workflow, with optimized algorithms for GDS data files.

~

Overview of the SCArray framework.

Figure 1: Overview of the SCArray framework

~

The SeuratObject package defines the Assay class with three members/slots counts, data and scale.data storing raw counts, normalized expressions and scaled data matrix respectively. However, counts and data should be either a dense matrix or a sparse matrix defined in the Matrix package. The scalability of the sparse matrix is limited by the number of non-zero values (should be < 2^31), since the Matrix package uses 32-bit indices internally. scale.data in the Assay class is defined as a dense matrix, so it is also limited by the available memory. The new class SCArrayAssay is derived from Assay, with three additional slots counts2, data2 and scale.data2 replacing the old ones. These new slots can be DelayedMatrix wrapping an on-disk data matrix, without loading the data in memory.

The SCArray.sat package takes advantage of the S3 object-oriented methods defined in the SeuratObject and Seurat packages to reduce code redundancy, by implementing the functions specific to the classes SCArrayAssay and SC_GDSMatrix (GDS-specific DelayedMatrix). Table 1 shows a list of key S3 methods for data analysis. For example, the function NormalizeData.SC_GDSMatrix() will be called when a SC_GDSMatrix object is passed to the S3 generic NormalizeData(), while NormalizeData.Assay() and NormalizeData.Seurat() are unchanged. In addition, the SCArray and SCArray.sat packages implement the optimized algorithms for the calculations, by reducing the on-disk data access and taking the GDS data structure into account.

~

Table 1: Key S3 methods implemented in the SCArray.sat package.

Methods Description Note
GetAssayData.SCArrayAssay() Accessor function for ‘SCArrayAssay’ objects
SetAssayData.SCArrayAssay Setter functions for ‘Assay’ objects
NormalizeData.SC_GDSMatrix() Normalize raw count data Store a DelayedMatrix
ScaleData.SC_GDSMatrix() Scale and center the normalized data
FindVariableFeatures.SC_GDSMatrix() Identifies features
RunPCA.SC_GDSMatrix() Run a PCA dimensionality reduction

SC_GDSMatrix: GDS- and single-cell- specific DelayedMatrix.

~

~

2 Installation

To install this package, start R and enter:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SCArray.sat")

3 Examples

3.1 Small Datasets

# Load the packages
suppressPackageStartupMessages({
    library(Seurat)
    library(SCArray)
    library(SCArray.sat)
})

# Input GDS file with raw counts
fn <- system.file("extdata", "example.gds", package="SCArray")

# show the file contents
(f <- scOpen(fn))
## Object of class "SCArrayFileClass"
## File: /home/biocbuild/bbs-3.21-bioc/R/site-library/SCArray/extdata/example.gds (504.7K)
## +    [  ] *
## |--+ feature.id   { Str8 1000 LZMA_ra(59.0%), 3.7K }
## |--+ sample.id   { Str8 850 LZMA_ra(13.2%), 1.8K }
## |--+ counts   { SparseReal32 1000x850 LZMA_ra(12.4%), 495.3K }
## |--+ feature.data   [  ]
## |--+ sample.data   [  ]
## |  |--+ Cell_ID   { Str8 850 LZMA_ra(13.2%), 1.8K }
## |  |--+ Cell_type   { Str8 850 LZMA_ra(2.98%), 165B }
## |  \--+ Timepoint   { Str8 850 LZMA_ra(3.73%), 229B }
## \--+ meta.data   [  ]
scClose(f)    # close the file


# Create a Seurat object from the GDS file
d <- scNewSeuratGDS(fn)
## Input: /home/biocbuild/bbs-3.21-bioc/R/site-library/SCArray/extdata/example.gds
##     counts: 1000 x 850
class(GetAssay(d))    # SCArrayAssay, derived from Assay
## [1] "SCArrayAssay"
## attr(,"package")
## [1] "SCArray.sat"
d <- NormalizeData(d)
## Performing log-normalization
d <- FindVariableFeatures(d, nfeatures=500)
## Calculating gene variances
## Calculating feature variances of standardized and clipped values
d <- ScaleData(d)
## Centering and scaling data matrix (SC_GDSMatrix [500x850])

Let’s check the internal data matrices,

# get the file name for the on-disk object
scGetFiles(d)
## [1] "/home/biocbuild/bbs-3.21-bioc/R/site-library/SCArray/extdata/example.gds"
# raw counts
m <- GetAssayData(d, slot="counts")
scGetFiles(m)    # the file name storing raw count data
## [1] "/home/biocbuild/bbs-3.21-bioc/R/site-library/SCArray/extdata/example.gds"
m
## <1000 x 850> sparse SC_GDSMatrix object of type "double":
##              1772122_301_C02 1772122_180_E05 ... 1772122_180_B06
##       MRPL20        2.133695        1.423133   .        0.000000
##         GNB1        3.342935        2.346631   .        0.000000
##        RPL22        2.133695        2.183267   .        2.982560
##        PARK7        1.247608        2.487018   .        1.980463
##         ENO1        3.037644        3.431596   .        3.129447
##          ...               .               .   .               .
##         SSR4        0.000000        2.346631   .        2.810320
##        RPL10        3.342935        1.987902   .        1.416593
## SLC25A6-loc1        2.391330        2.183267   .        2.338835
##       RPS4Y1        0.000000        2.183267   .        1.980463
##         CD24        3.821575        1.744869   .        0.000000
##              1772122_180_D09
##       MRPL20        1.757196
##         GNB1        0.000000
##        RPL22        2.733620
##        PARK7        1.757196
##         ENO1        2.360130
##          ...               .
##         SSR4        1.223211
##        RPL10        2.103432
## SLC25A6-loc1        1.223211
##       RPS4Y1        2.360130
##         CD24        1.757196
# normalized expression
# the normalized data does not save in neither the file nor the memory
GetAssayData(d, slot="data")
## <1000 x 850> sparse SC_GDSMatrix object of type "double":
##              1772122_301_C02 1772122_180_E05 ... 1772122_180_B06
##       MRPL20        2.133695        1.423133   .        0.000000
##         GNB1        3.342935        2.346631   .        0.000000
##        RPL22        2.133695        2.183267   .        2.982560
##        PARK7        1.247608        2.487018   .        1.980463
##         ENO1        3.037644        3.431596   .        3.129447
##          ...               .               .   .               .
##         SSR4        0.000000        2.346631   .        2.810320
##        RPL10        3.342935        1.987902   .        1.416593
## SLC25A6-loc1        2.391330        2.183267   .        2.338835
##       RPS4Y1        0.000000        2.183267   .        1.980463
##         CD24        3.821575        1.744869   .        0.000000
##              1772122_180_D09
##       MRPL20        1.757196
##         GNB1        0.000000
##        RPL22        2.733620
##        PARK7        1.757196
##         ENO1        2.360130
##          ...               .
##         SSR4        1.223211
##        RPL10        2.103432
## SLC25A6-loc1        1.223211
##       RPS4Y1        2.360130
##         CD24        1.757196
# scaled and centered data matrix
# in this example, the scaled data does not save in neither the file nor the memory
GetAssayData(d, slot="scale.data")
## <1000 x 850> sparse SC_GDSMatrix object of type "double":
##              1772122_301_C02 1772122_180_E05 ... 1772122_180_B06
##       MRPL20        2.133695        1.423133   .        0.000000
##         GNB1        3.342935        2.346631   .        0.000000
##        RPL22        2.133695        2.183267   .        2.982560
##        PARK7        1.247608        2.487018   .        1.980463
##         ENO1        3.037644        3.431596   .        3.129447
##          ...               .               .   .               .
##         SSR4        0.000000        2.346631   .        2.810320
##        RPL10        3.342935        1.987902   .        1.416593
## SLC25A6-loc1        2.391330        2.183267   .        2.338835
##       RPS4Y1        0.000000        2.183267   .        1.980463
##         CD24        3.821575        1.744869   .        0.000000
##              1772122_180_D09
##       MRPL20        1.757196
##         GNB1        0.000000
##        RPL22        2.733620
##        PARK7        1.757196
##         ENO1        2.360130
##          ...               .
##         SSR4        1.223211
##        RPL10        2.103432
## SLC25A6-loc1        1.223211
##       RPS4Y1        2.360130
##         CD24        1.757196

Perform PCA and UMAP:

d <- RunPCA(d, ndims.print=1:2)
## Run PCA on the scaled data matrix ...
## Calculating the covariance matrix [500x500] ...
## PC_ 1 
## Positive:  NPM1, RPLP1, RPL35, HNRNPA1P10, RPS20, RPS6, RPS19, RPS3, HMGB2, RPL32 
##     RPL31P11-p1, HSPE1-MOB4, RPS23, HMGN2, RPS10-NUDT3, SNRPE, RPS24, CKS1B, H2AFZ, RPS14 
##     RPA3, RPL18A, RPS18-loc6, EEF1B2, SHFM1, TMA7, KIAA0101, RPS3A, RPL37A, SNRPG 
## Negative:  DCX, STMN2, MAP1B, NCAM1, GAP43, RTN1, BASP1, KIF5C, DPYSL3, DCC 
##     MIAT, TTC3, MALAT1, CRMP1, SOX11, TUBB3, GPM6A, TUBA1A, WSB1, TUBB2B 
##     RTN4, NNAT, SCG2, TUBB2A, MAP2, SEZ6L2, ONECUT2, MAP6, ENO2, CNTN2 
## PC_ 2 
## Positive:  TUBA1B, NUCKS1, HNRNPA2B1, MARCKSL1, MARCKS, HNRNPD, NES, HNRNPA1, KHDRBS1, LOC644936-p1 
##     CKB, SET, MIR1244-3-loc4, TUBA1C, SNORD38A, DEK, SOX11, SFPQ, HNRNPU, IGF2BP1 
##     CBX5, NASP, RPS17-loc1, SMC4, RPS17-loc2, RPL41, CENPF, HMGB1, HDAC2, RRM1 
## Negative:  RPL13AP5, RPL31P11-p1, RPS14, RPS3A, TTR, RPL37A, RPL18A, PMCH, RPLP1, RPS19 
##     RPL23A, RPS3, OLFM3, ANXA2, RPL32, RPS13, SULF1, CDO1, TRPM3, COL1A1 
##     RPL18, MTRNR2L8, RNA5-8S5-loc2, MIR611, MALAT1, HTR2C, RNA5-8S5-loc1, RPS25, HES1, LDHA
DimPlot(d, reduction="pca")