# 1 Introduction

scuttle provides various low-level utilities for single-cell RNA-seq data analysis, focusing on basic normalization, quality control and transformation of expression values. These functions are intended for use by analysts, usually at the start of the analysis workflow; or by developers of other packages, to assemble more complex functions in downstream analysis steps.

To demonstrate, we will obtain the classic Zeisel dataset from the scRNAseq package. In this case, the dataset is provided as a SingleCellExperiment object. However, most scuttle functions can also be used with raw expression matrices or instances of the more general SummarizedExperiment class.

library(scRNAseq)
sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment
## dim: 20006 3005
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## altExpNames(2): ERCC repeat

Note: A more comprehensive description of the use of scuttle functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

# 2 Cell-level quality control

## 2.1 Computing metrics

The perCellQCMetrics() function computes a variety of basic cell-level metrics, including sum, total number of counts for the cell (i.e., the library size); detected, the number of features for the cell that have counts above the detection limit (default of zero); and subsets_X_percent, percentage of all counts that come from the feature control set named X.

library(scuttle)
is.mito <- grep("mt-", rownames(sce))
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
summary(per.cell$sum) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2574 8130 12913 14954 19284 63505 summary(per.cell$detected)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     785    2484    3656    3777    4929    8167
summary(per.cell$subsets_Mito_percent) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 3.992 6.653 7.956 10.290 56.955 It is often convenient to store this in the colData() of our SingleCellExperiment object for future reference. One can either do so manually: colData(sce) <- cbind(colData(sce), per.cell) … or just use the addPerCellQC() function: sce2 <- addPerCellQC(sce, subsets=list(Mito=is.mito)) colnames(colData(sce2)) ## [1] "tissue" "group #" ## [3] "total mRNA mol" "well" ## [5] "sex" "age" ## [7] "diameter" "cell_id" ## [9] "level1class" "level2class" ## [11] "sum" "detected" ## [13] "subsets_Mito_sum" "subsets_Mito_detected" ## [15] "subsets_Mito_percent" "altexps_ERCC_sum" ## [17] "altexps_ERCC_detected" "altexps_ERCC_percent" ## [19] "altexps_repeat_sum" "altexps_repeat_detected" ## [21] "altexps_repeat_percent" "total" ## [23] "sum" "detected" ## [25] "subsets_Mito_sum" "subsets_Mito_detected" ## [27] "subsets_Mito_percent" "altexps_ERCC_sum" ## [29] "altexps_ERCC_detected" "altexps_ERCC_percent" ## [31] "altexps_repeat_sum" "altexps_repeat_detected" ## [33] "altexps_repeat_percent" "total" ## 2.2 Filtering low-quality cells We identify low-quality cells by setting a threshold on each of these metrics using the isOutlier() function. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median; values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using type="lower") for the log-total counts at 3 MADs from the median. keep.total <- !isOutlier(per.cell$sum, type="lower", log=TRUE)
filtered <- sce[,keep.total]

For typical scRNA-seq applications, quickPerCellQC() will conveniently detect outliers across several common metrics. This uses the total count, number of detected features and the percentage of counts in gene sets of diagnostic value (e.g., mitochondrial genes, spike-in transcripts) to identify which cells to discard and for what reason.

qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
colSums(as.matrix(qc.stats))
##              low_lib_size            low_n_features high_subsets_Mito_percent
##                         0                         3                       128
##                       131
filtered <- sce[,!qc.stats$discard] The isOutlier approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, differences in total RNA content between cell types. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the book for more details. # 3 Computing feature-level statistics Some basic feature-level statistics are computed by the perFeatureQCMetrics() function. This includes mean, the mean count of the gene/feature across all cells; detected, the percentage of cells with non-zero counts for each gene; subsets_Y_ratio, ratio of mean counts between the cell control set named Y and all cells. # Pretending that the first 10 cells are empty wells, for demonstration. per.feat <- perFeatureQCMetrics(sce, subsets=list(Empty=1:10)) summary(per.feat$mean)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##   0.0007   0.0097   0.1338   0.7475   0.5763 732.1524
summary(per.feat$detected) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03328 0.76539 9.01830 18.87800 31.24792 99.96672 summary(per.feat$subsets_Empty_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   0.000   0.601   1.872   2.016 300.500

A more refined calculation of the average is provided by the calculateAverage() function, which adjusts the counts by the relative library size (or size factor) prior to taking the mean.

ave <- calculateAverage(sce)
summary(ave)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##   0.0002   0.0109   0.1443   0.7475   0.5674 850.6880

These metrics tend to be more useful for informing the analyst about the overall behavior of the experiment, rather than being explicitly used to filter out genes. For example, one would hope that the most abundant genes are the “usual suspects”, e.g., ribosomal proteins, actin, histones.

# 4 Utilities for normalizaton

## 4.1 Computing size factors

scuttle provides a number of utilities for global scaling normalization, where the counts for each cell are divided by a cell-specific “size factor” to adjust for uninteresting differences in sequencing depth and capture efficiency. By default, the size factor is automatically computed from the library size of each cell using the librarySizeFactors() function. This calculation simply involves scaling the library sizes so that they have a mean of 1 across all cells.

summary(librarySizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.1721  0.5437  0.8635  1.0000  1.2895  4.2466

scuttle also implements two other basic methods of computing size factors, either from the geometric mean or using a DESeq-esque approach based on the median. These have their own strengths and weaknesses that are discussed in the relevant documentation pages.

summary(geometricSizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.8304  0.9077  0.9761  1.0000  1.0653  1.5320
summary(medianSizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##      NA      NA      NA     NaN      NA      NA    3005

Note that if size factors are explicitly provided in the SingleCellExperiment, they will be used by the downstream normalization functions. Size factors can be manually added by sizeFactors()<- or with the functions like:

sce <- computeLibraryFactors(sce)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.1721  0.5437  0.8635  1.0000  1.2895  4.2466

## 4.2 Normalizing expression values

The most commonly used function is logNormCounts(), which calculates log2-transformed normalized expression values by dividing each count by the size factor for the cell, adding a constant pseudo-count and log-transforming. The resulting values can be roughly interpreted on the same scale as log-transformed counts and are stored in "logcounts".

sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts"    "logcounts"

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay. Here, we use the normalizeCounts() function to perform some custom normalization with random size factors.

assay(sce, "normed") <- normalizeCounts(sce,
size.factors=runif(ncol(sce)), pseudo.count=1.5)

scuttle can also calculate counts-per-million using the aptly-named calculateCPM() function. The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object. Related functions include calculateTPM() and calculateFPKM(), which do pretty much as advertised.

assay(sce, "cpm") <- calculateCPM(sce)

# 5 Aggregation across groups or clusters

The aggregateAcrossCells() function is helpful for aggregating expression values across groups of cells. For example, we might wish to sum together counts for all cells in the same cluster, possibly to use as a summary statistic for downstream analyses (e.g., for differential expression with edgeR). This will also perform the courtesy of sensibly aggregating the column metadata for downstream use.

agg.sce <- aggregateAcrossCells(sce, ids=sce$level1class) head(assay(agg.sce)) ## astrocytes_ependymal endothelial-mural interneurons microglia ## Tspan12 74 138 186 14 ## Tshz1 96 105 335 43 ## Fnbp1l 89 169 689 33 ## Adamts15 2 23 64 0 ## Cldn12 50 27 160 5 ## Rxfp1 0 3 74 0 ## oligodendrocytes pyramidal CA1 pyramidal SS ## Tspan12 99 269 58 ## Tshz1 297 65 332 ## Fnbp1l 239 859 1056 ## Adamts15 19 11 15 ## Cldn12 214 411 273 ## Rxfp1 13 48 30 colData(agg.sce)[,c("ids", "ncells")] ## DataFrame with 7 rows and 2 columns ## ids ncells ## <character> <integer> ## astrocytes_ependymal astrocytes_ependymal 224 ## endothelial-mural endothelial-mural 235 ## interneurons interneurons 290 ## microglia microglia 98 ## oligodendrocytes oligodendrocytes 820 ## pyramidal CA1 pyramidal CA1 939 ## pyramidal SS pyramidal SS 399 It is similarly possible to sum across multiple factors, as shown below for the cell type and the tissue of origin. This yields one column per combination of cell type and tissue, which allows us to conveniently perform downstream analyses with both factors. agg.sce <- aggregateAcrossCells(sce, ids=colData(sce)[,c("level1class", "tissue")]) head(assay(agg.sce)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] ## Tspan12 23 51 10 128 52 134 0 14 10 89 269 58 ## Tshz1 38 58 10 95 90 245 1 42 50 247 65 332 ## Fnbp1l 23 66 9 160 227 462 3 30 18 221 859 1056 ## Adamts15 0 2 2 21 15 49 0 0 0 19 11 15 ## Cldn12 5 45 0 27 61 99 0 5 22 192 411 273 ## Rxfp1 0 0 0 3 3 71 0 0 1 12 48 30 colData(agg.sce)[,c("level1class", "tissue", "ncells")] ## DataFrame with 12 rows and 3 columns ## level1class tissue ncells ## <character> <character> <integer> ## 1 astrocytes_ependymal ca1hippocampus 81 ## 2 astrocytes_ependymal sscortex 143 ## 3 endothelial-mural ca1hippocampus 33 ## 4 endothelial-mural sscortex 202 ## 5 interneurons ca1hippocampus 126 ## ... ... ... ... ## 8 microglia sscortex 84 ## 9 oligodendrocytes ca1hippocampus 121 ## 10 oligodendrocytes sscortex 699 ## 11 pyramidal CA1 ca1hippocampus 939 ## 12 pyramidal SS sscortex 399 Summation across rows may occasionally be useful for obtaining a measure of the activity of a gene set, e.g., in a pathway. Given a list of gene sets, we can use the sumCountsAcrossFeatures() function to aggregate expression values across features. This is usually best done by averaging the log-expression values as shown below. agg.feat <- sumCountsAcrossFeatures(sce, ids=list(GeneSet1=1:10, GeneSet2=11:50, GeneSet3=1:100), average=TRUE, exprs_values="logcounts") agg.feat[,1:10] ## 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 ## GeneSet1 0.7717764 0.2177634 0.7577241 0.5380195 ## GeneSet2 1.0602489 0.9955551 1.2558629 1.1482845 ## GeneSet3 1.1410273 1.0018648 1.2727650 1.1995992 ## 1772067065_H06 1772071017_E02 1772067065_B07 1772067060_B09 ## GeneSet1 0.4987205 0.3528216 0.5262005 0.0749138 ## GeneSet2 1.2010854 1.2855797 1.2889492 1.1806065 ## GeneSet3 1.1244981 1.1567887 1.1057770 1.1353704 ## 1772071014_E04 1772071015_D04 ## GeneSet1 0.7267285 0.7924899 ## GeneSet2 1.1728326 1.2871506 ## GeneSet3 1.0098231 1.1846439 Similar functions are available to compute the number or proportion of cells with detectable expression in each group. To wit: agg.n <- numDetectedAcrossCells(sce, ids=colData(sce)[,c("level1class", "tissue")]) head(assay(agg.n)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] ## Tspan12 16 21 6 56 36 59 0 8 7 50 177 36 ## Tshz1 17 22 7 45 46 78 1 12 26 136 48 135 ## Fnbp1l 13 27 7 63 80 123 2 18 9 106 432 287 ## Adamts15 0 2 2 10 6 29 0 0 0 10 9 9 ## Cldn12 4 20 0 12 37 60 0 4 17 103 271 141 ## Rxfp1 0 0 0 2 2 35 0 0 1 8 37 19 # 6 Miscellaneous functions ## 6.1 Reading in sparse matrices Normally, sparse matrices are provided in the MatrixMarket (.mtx) format, where they can be read efficiently into memory using the readMM() function from the Matrix package. However, for some reason, it has been popular to save these files in dense form as tab- or comma-separate files. This is an inefficient and inconvenient approach, requiring users to read in the entire dataset in dense form with functions like read.delim() or read.csv() (and hope that they have enough memory on their machines to do so). In such cases, scuttle provides the readSparseCounts() function to overcome excessive memory requirements. This reads in the dataset in a chunkwise manner, progressively coercing each chunk into a sparse format and combining them into a single sparse matrix to be returned to the user. In this manner, we never attempt to load the entire dataset in dense format to memory. # Mocking up a dataset to demonstrate: outfile <- tempfile() write.table(counts(sce)[1:100,], file=outfile, sep="\t", quote=FALSE) # Reading it in as a sparse matrix: output <- readSparseCounts(outfile) class(output) ## [1] "dgCMatrix" ## attr(,"package") ## [1] "Matrix" ## 6.2 Making gene symbols unique When publishing a dataset, the best practice is to provide gene annotations in the form of a stable identifier like those from Ensembl or Entrez. This provides an unambiguous reference to the identity of the gene, avoiding difficulties with synonynms and making it easier to cross-reference. However, when working with a dataset, it is more convenient to use the gene symbols as these are easier to remember. Thus, a common procedure is to replace the stable identifiers in the row names with gene symbols. However, this is not straightforward as the gene symbols may not exist (NAs) or may be duplicated. To assist this process, scuttle provides the uniquifyFeatureNames() function that emit gene symbols if they are unique; append the identifier, if they are duplicated; and replace the symbol with the identifier if the former is missing. # Original row names are Ensembl IDs. sce.ens <- ZeiselBrainData(ensembl=TRUE) head(rownames(sce.ens))  ## [1] "ENSMUSG00000029669" "ENSMUSG00000046982" "ENSMUSG00000039735" ## [4] "ENSMUSG00000033453" "ENSMUSG00000046798" "ENSMUSG00000034009" # Replacing with guaranteed unique and non-missing symbols: rownames(sce.ens) <- uniquifyFeatureNames( rownames(sce.ens), rowData(sce.ens)$originalName
)
head(rownames(sce.ens)) 
## [1] "Tspan12"  "Tshz1"    "Fnbp1l"   "Adamts15" "Cldn12"   "Rxfp1"

## 6.3 Creating a data.frame

The makePerCellDF() and makePerFeatureDF() functions create data.frames from the SingleCellExperiment object. In the makePerCellDF() case, each row of the output data.frame corresponds to a cell and each column represents the expression of a specified feature across cells, a field of the column metadata, or reduced dimensions (if any are available).

out <- makePerCellDF(sce, features="Tspan12")
colnames(out)
##  [1] "Tspan12"                 "tissue"
##  [3] "group.."                 "total.mRNA.mol"
##  [5] "well"                    "sex"
##  [7] "age"                     "diameter"
##  [9] "cell_id"                 "level1class"
## [11] "level2class"             "sum"
## [13] "detected"                "subsets_Mito_sum"
## [15] "subsets_Mito_detected"   "subsets_Mito_percent"
## [17] "altexps_ERCC_sum"        "altexps_ERCC_detected"
## [19] "altexps_ERCC_percent"    "altexps_repeat_sum"
## [21] "altexps_repeat_detected" "altexps_repeat_percent"
## [23] "total"                   "sizeFactor"

In the makePerFeatureDF() case, each row of the output data.frame corresponds to a gene and each column represents the expression profile of a specified cell or the values of a row metadata field.

out2 <- makePerFeatureDF(sce, cells=c("1772063062_D05",
"1772063061_D01", "1772060240_F02", "1772062114_F05"))
colnames(out2)
## [1] "1772063062_D05" "1772063061_D01" "1772060240_F02" "1772062114_F05"
## [5] "featureType"

The aim is to enable the data in a SingleCellExperiment to be easilybe used in functions like model.matrix() or in ggplot(), without requiring users to manually extract the desired fields from the SingleCellExperiment to construct their own data.frame.

# Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] ensembldb_2.14.0            AnnotationFilter_1.14.0
##  [3] GenomicFeatures_1.42.1      AnnotationDbi_1.52.0
##  [5] scuttle_1.0.4               scRNAseq_2.4.0
##  [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
##  [9] Biobase_2.50.0              GenomicRanges_1.42.0
## [11] GenomeInfoDb_1.26.2         IRanges_2.24.1
## [13] S4Vectors_0.28.1            BiocGenerics_0.36.0
## [15] MatrixGenerics_1.2.0        matrixStats_0.57.0
## [17] BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2                    bit64_4.0.5
##  [3] AnnotationHub_2.22.0          DelayedMatrixStats_1.12.1
##  [5] shiny_1.5.0                   assertthat_0.2.1
##  [7] askpass_1.1                   interactiveDisplayBase_1.28.0
##  [9] BiocManager_1.30.10           BiocFileCache_1.14.0
## [11] blob_1.2.1                    Rsamtools_2.6.0
## [13] GenomeInfoDbData_1.2.4        yaml_2.2.1
## [15] progress_1.2.2                BiocVersion_3.12.0
## [17] pillar_1.4.7                  RSQLite_2.2.1
## [19] lattice_0.20-41               beachmat_2.6.3
## [21] glue_1.4.2                    digest_0.6.27
## [23] promises_1.1.1                XVector_0.30.0
## [25] htmltools_0.5.0               httpuv_1.5.4
## [27] Matrix_1.2-18                 XML_3.99-0.5
## [29] pkgconfig_2.0.3               biomaRt_2.46.0
## [31] bookdown_0.21                 zlibbioc_1.36.0
## [33] purrr_0.3.4                   xtable_1.8-4
## [35] later_1.1.0.1                 BiocParallel_1.24.1
## [37] openssl_1.4.3                 tibble_3.0.4
## [39] generics_0.1.0                ellipsis_0.3.1
## [41] withr_2.3.0                   lazyeval_0.2.2
## [43] magrittr_2.0.1                crayon_1.3.4
## [45] mime_0.9                      memoise_1.1.0
## [47] evaluate_0.14                 xml2_1.3.2
## [49] prettyunits_1.1.1             tools_4.0.3
## [51] hms_0.5.3                     lifecycle_0.2.0
## [53] stringr_1.4.0                 DelayedArray_0.16.0
## [55] Biostrings_2.58.0             compiler_4.0.3
## [57] rlang_0.4.9                   grid_4.0.3
## [59] RCurl_1.98-1.2                rappdirs_0.3.1
## [61] bitops_1.0-6                  rmarkdown_2.6
## [63] ExperimentHub_1.16.0          DBI_1.1.0
## [65] curl_4.3                      R6_2.5.0
## [67] GenomicAlignments_1.26.0      rtracklayer_1.50.0
## [69] knitr_1.30                    dplyr_1.0.2
## [71] fastmap_1.0.1                 bit_4.0.4
## [73] ProtGenerics_1.22.0           stringi_1.5.3
## [75] Rcpp_1.0.5                    vctrs_0.3.6
## [77] sparseMatrixStats_1.2.0       dbplyr_2.0.0
## [79] tidyselect_1.1.0              xfun_0.19