Chapter 19 Single-nuclei RNA-seq processing
Single-nuclei RNA-seq (snRNA-seq) provides another strategy for performing single-cell transcriptomics where individual nuclei instead of cells are captured and sequenced. The major advantage of snRNA-seq over scRNA-seq is that the former does not require the preservation of cellular integrity during sample preparation, especially dissociation. We only need to extract nuclei in an intact state, meaning that snRNA-seq can be applied to cell types, tissues and samples that are not amenable to dissociation and later processing. The cost of this flexibility is the loss of transcripts that are primarily located in the cytoplasm, potentially limiting the availability of biological signal for genes with little nuclear localization.
The computational analysis of snRNA-seq data is very much like that of scRNA-seq data. We have a matrix of (UMI) counts for genes by cells that requires quality control, normalization and so on. (Technically, the columsn correspond to nuclei but we will use these two terms interchangeably in this chapter.) In fact, the biggest difference in processing occurs in the construction of the count matrix itself, where intronic regions must be included in the annotation for each gene to account for the increased abundance of unspliced transcripts. The rest of the analysis only requires a few minor adjustments to account for the loss of cytoplasmic transcripts. We demonstrate using a dataset from Wu et al. (2019) involving snRNA-seq on healthy and fibrotic mouse kidneys.
## class: SingleCellExperiment ## dim: 18249 8231 ## metadata(0): ## assays(1): counts ## rownames(18249): mt-Cytb mt-Nd6 ... Gm44613 Gm38304 ## rowData names(0): ## colnames(8231): sNuc-10x_AAACCTGAGTCCGGTC sNuc-10x_AAACCTGCACAGACAG ... ## UUO_TTGCCGTCACAAGACG UUO_TTTGTCATCTGCTGTC ## colData names(2): Technology Status ## reducedDimNames(0): ## altExpNames(0):
19.2 Quality control for stripped nuclei
The loss of the cytoplasm means that the stripped nuclei should not contain any mitochondrial transcripts. This means that the mitochondrial proportion becomes an excellent QC metric for the efficacy of the stripping process. Unlike scRNA-seq, there is no need to worry about variations in mitochondrial content due to genuine biology. High-quality nuclei should not contain any mitochondrial transcripts; the presence of any mitochondrial counts in a library indicates that the removal of the cytoplasm was not complete, possibly introducing irrelevant heterogeneity in downstream analyses.
## Mode FALSE TRUE ## logical 2264 5967
We apply a simple filter to remove libraries corresponding to incompletely stripped nuclei. The outlier-based approach described in Section 6 can be used here, but some caution is required in low-coverage experiments where a majority of cells have zero mitochondrial counts. In such cases, the MAD may also be zero such that other libraries with very low but non-zero mitochondrial counts are removed. This is typically too conservative as such transcripts may be present due to sporadic ambient contamination rather than incomplete stripping.
## low_lib_size low_n_features high_subsets_Mt_percent ## 0 0 2264 ## discard ## 2264
Instead, we enforce a minimum difference between the threshold and the median in
isOutlier() (Figure 19.1).
We arbitrarily choose +0.5% here, which takes precedence over the outlier-based threshold if the latter is too low.
In this manner, we avoid discarding libraries with a very modest amount of contamination; the same code will automatically fall back to the outlier-based threshold in datasets where the stripping was systematically less effective.
## low_lib_size low_n_features high_subsets_Mt_percent ## 0 0 42 ## discard ## 42
19.4 Tricks with ambient contamination
The expected absence of genuine mitochondrial expression can also be exploited to estimate the level of ambient contamination (Section 14.4). We demonstrate on mouse brain snRNA-seq data from 10X Genomics (Zheng et al. 2017), using the raw count matrix prior to any filtering for nuclei-containing barcodes.
library(DropletTestFiles) raw.path <- getTestFile("tenx-2.0.1-nuclei_900/1.0.0/raw.tar.gz") out.path <- file.path(tempdir(), "nuclei") untar(raw.path, exdir=out.path) library(DropletUtils) fname <- file.path(out.path, "raw_gene_bc_matrices/mm10") sce.brain <- read10xCounts(fname, col.names=TRUE) sce.brain
## class: SingleCellExperiment ## dim: 27998 737280 ## metadata(1): Samples ## assays(1): counts ## rownames(27998): ENSMUSG00000051951 ENSMUSG00000089699 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(2): ID Symbol ## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ... ## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## altExpNames(0):
We call non-empty droplets using
emptyDrops() as previously described (Section 15.2).
## Mode FALSE TRUE NA's ## logical 2324 1712 733244
If our libraries are of high quality, we can assume that any mitochondrial “expression” is due to contamination from the ambient solution.
We then use the
controlAmbience() function to estimate the proportion of ambient contamination for each gene, allowing us to mark potentially problematic genes in the DE results (Figure 19.4).
In fact, we can use this information even earlier to remove these genes during dimensionality reduction and clustering.
This is not generally possible for scRNA-seq as any notable contaminating transcripts may originate from a subpopulation that actually expresses that gene and thus cannot be blindly removed.
ambient <- estimateAmbience(counts(sce.brain), round=FALSE, good.turing=FALSE) nuclei <- rowSums(counts(sce.brain)[,which(e.out$FDR <= 0.001)]) is.mito <- grepl("mt-", rowData(sce.brain)$Symbol) contam <- controlAmbience(nuclei, ambient, features=is.mito, mode="proportion") plot(log10(nuclei+1), contam*100, col=ifelse(is.mito, "red", "grey"), pch=16, xlab="Log-nuclei expression", ylab="Contamination (%)")
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=C  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  DropletUtils_1.10.0 DropletTestFiles_1.0.0  batchelor_1.6.0 bluster_1.0.0  scran_1.18.0 scater_1.18.0  ggplot2_3.3.2 scuttle_1.0.0  scRNAseq_2.4.0 SingleCellExperiment_1.12.0  SummarizedExperiment_1.20.0 Biobase_2.50.0  GenomicRanges_1.42.0 GenomeInfoDb_1.26.0  IRanges_2.24.0 S4Vectors_0.28.0  BiocGenerics_0.36.0 MatrixGenerics_1.2.0  matrixStats_0.57.0 BiocStyle_2.18.0  rebook_1.0.0 loaded via a namespace (and not attached):  AnnotationHub_2.22.0 BiocFileCache_1.14.0  igraph_1.2.6 lazyeval_0.2.2  BiocParallel_1.24.0 digest_0.6.27  ensembldb_2.14.0 htmltools_0.5.0  viridis_0.5.1 magrittr_1.5  memoise_1.1.0 limma_3.46.0  Biostrings_2.58.0 R.utils_2.10.1  askpass_1.1 prettyunits_1.1.1  colorspace_1.4-1 blob_1.2.1  rappdirs_0.3.1 xfun_0.19  dplyr_1.0.2 callr_3.5.1  crayon_1.3.4 RCurl_1.98-1.2  graph_1.68.0 glue_1.4.2  gtable_0.3.0 zlibbioc_1.36.0  XVector_0.30.0 DelayedArray_0.16.0  BiocSingular_1.6.0 Rhdf5lib_1.12.0  HDF5Array_1.18.0 scales_1.1.1  DBI_1.1.0 edgeR_3.32.0  Rcpp_1.0.5 viridisLite_0.3.0  xtable_1.8-4 progress_1.2.2  dqrng_0.2.1 bit_4.0.4  rsvd_1.0.3 ResidualMatrix_1.0.0  httr_1.4.2 ellipsis_0.3.1  R.methodsS3_1.8.1 pkgconfig_2.0.3  XML_3.99-0.5 farver_2.0.3  CodeDepends_0.6.5 dbplyr_1.4.4  locfit_1.5-9.4 tidyselect_1.1.0  labeling_0.4.2 rlang_0.4.8  later_22.214.171.124 AnnotationDbi_1.52.0  munsell_0.5.0 BiocVersion_3.12.0  tools_4.0.3 generics_0.1.0  RSQLite_2.2.1 ExperimentHub_1.16.0  evaluate_0.14 stringr_1.4.0  fastmap_1.0.1 yaml_2.2.1  processx_3.4.4 knitr_1.30  bit64_4.0.5 purrr_0.3.4  AnnotationFilter_1.14.0 sparseMatrixStats_1.2.0  mime_0.9 R.oo_1.24.0  xml2_1.3.2 biomaRt_2.46.0  compiler_4.0.3 beeswarm_0.2.3  curl_4.3 interactiveDisplayBase_1.28.0  tibble_3.0.4 statmod_1.4.35  stringi_1.5.3 highr_0.8  ps_1.4.0 GenomicFeatures_1.42.0  lattice_0.20-41 ProtGenerics_1.22.0  Matrix_1.2-18 vctrs_0.3.4  rhdf5filters_1.2.0 pillar_1.4.6  lifecycle_0.2.0 BiocManager_1.30.10  BiocNeighbors_1.8.0 cowplot_1.1.0  bitops_1.0-6 irlba_2.3.3  httpuv_1.5.4 rtracklayer_1.50.0  R6_2.5.0 bookdown_0.21  promises_1.1.1 gridExtra_2.3  vipor_0.4.5 codetools_0.2-16  assertthat_0.2.1 rhdf5_2.34.0  openssl_1.4.3 withr_2.3.0  GenomicAlignments_1.26.0 Rsamtools_2.6.0  GenomeInfoDbData_1.2.4 hms_0.5.3  grid_4.0.3 beachmat_2.6.0  rmarkdown_2.5 DelayedMatrixStats_1.12.0  Rtsne_0.15 shiny_1.5.0  ggbeeswarm_0.6.0
Bakken, T. E., R. D. Hodge, J. A. Miller, Z. Yao, T. N. Nguyen, B. Aevermann, E. Barkan, et al. 2018. “Single-nucleus and single-cell transcriptomes compared in matched cortical cell types.” PLoS ONE 13 (12): e0209648.
Wu, H., Y. Kirita, E. L. Donnelly, and B. D. Humphreys. 2019. “Advantages of Single-Nucleus over Single-Cell RNA Sequencing of Adult Kidney: Rare Cell Types and Novel Cell States Revealed in Fibrosis.” J. Am. Soc. Nephrol. 30 (1): 23–32.
Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.