Chapter 25 Lun 416B cell line (Smart-seq2)

25.1 Introduction

The A. T. L. Lun et al. (2017) dataset contains two 96-well plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences.

25.3 Quality control

We save an unfiltered copy of the SingleCellExperiment for later use.

Technically, we do not need to use the mitochondrial proportions as we already have the spike-in proportions (which serve a similar purpose) for this dataset. However, it probably doesn’t do any harm to include it anyway.

Distribution of each QC metric across cells in the 416B dataset, stratified by the plate of origin. Each point represents a cell and is colored according to whether that cell was discarded.

Figure 25.1: Distribution of each QC metric across cells in the 416B dataset, stratified by the plate of origin. Each point represents a cell and is colored according to whether that cell was discarded.

Percentage of mitochondrial reads in each cell in the 416B dataset, compared to the total count (left) or the percentage of spike-in reads (right). Each point represents a cell and is colored according to whether that cell was discarded.

Figure 25.2: Percentage of mitochondrial reads in each cell in the 416B dataset, compared to the total count (left) or the percentage of spike-in reads (right). Each point represents a cell and is colored according to whether that cell was discarded.

We also examine the number of cells removed for each reason.

##              low_lib_size            low_n_features   high_subsets_Mt_percent 
##                         5                         0                         2 
## high_altexps_ERCC_percent                   discard 
##                         2                         7

25.4 Normalization

No pre-clustering is performed here, as the dataset is small and all cells are derived from the same cell line anyway.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.347   0.711   0.921   1.000   1.152   3.604

We see that the induced cells have size factors that are systematically shifted from the uninduced cells, consistent with the presence of a composition bias (Figure 25.3).

Relationship between the library size factors and the deconvolution size factors in the 416B dataset. Each cell is colored according to its oncogene induction status.

Figure 25.3: Relationship between the library size factors and the deconvolution size factors in the 416B dataset. Each cell is colored according to its oncogene induction status.

25.5 Variance modelling

We block on the plate of origin to minimize plate effects, and then we take the top 10% of genes with the largest biological components.

Per-gene variance as a function of the mean for the log-expression values in the 416B dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red). This was performed separately for each plate.

Figure 25.4: Per-gene variance as a function of the mean for the log-expression values in the 416B dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red). This was performed separately for each plate.

25.6 Batch correction

The composition of cells is expected to be the same across the two plates, hence the use of removeBatchEffect() rather than more complex methods. For larger datasets, consider using regressBatches() from the batchelor package.

25.7 Dimensionality reduction

We do not expect a great deal of heterogeneity in this dataset, so we only request 10 PCs. We use an exact SVD to avoid warnings from irlba about handling small datasets.

25.8 Clustering

We compare the clusters to the plate of origin. Each cluster is comprised of cells from both batches, indicating that the clustering is not driven by a batch effect.

##        Plate
## Cluster 20160113 20160325
##       1       40       38
##       2       37       32
##       3       10       14
##       4        6        8

We compare the clusters to the oncogene induction status. We observe differences in in the composition of each cluster, consistent with a biological effect of oncogene induction.

##        Oncogene
## Cluster induced CBFB-MYH11 oncogene expression wild type phenotype
##       1                                     78                   0
##       2                                      0                  69
##       3                                      1                  23
##       4                                     14                   0
Obligatory $t$-SNE plot of the 416B dataset, where each point represents a cell and is colored according to the assigned cluster.

Figure 25.5: Obligatory \(t\)-SNE plot of the 416B dataset, where each point represents a cell and is colored according to the assigned cluster.

Most cells have relatively small positive widths in Figure 25.6, indicating that the separation between clusters is weak. This may be symptomatic of over-clustering where clusters that are clearly defined on oncogene induction status are further split into subsets that are less well separated. Nonetheless, we will proceed with the current clustering scheme as it provides reasonable partitions for further characterization of heterogeneity.

Silhouette plot for the hierarchical clustering of the 416B dataset. Each bar represents the silhouette width for a cell and is colored according to the assigned cluster (if positive) or the closest cluster (if negative).

Figure 25.6: Silhouette plot for the hierarchical clustering of the 416B dataset. Each bar represents the silhouette width for a cell and is colored according to the assigned cluster (if positive) or the closest cluster (if negative).

25.9 Interpretation

## DataFrame with 10 rows and 7 columns
##             Top     p.value         FDR summary.logFC   logFC.2   logFC.3
##       <integer>   <numeric>   <numeric>     <numeric> <numeric> <numeric>
## Ccna2         1 9.85422e-67 4.59246e-62      -7.13310  -7.13310  -2.20632
## Cdca8         1 1.01449e-41 1.52514e-38      -7.26175  -6.00378  -2.03841
## Pirb          1 4.16555e-33 1.95516e-30       5.87820   5.28149   5.87820
## Cks1b         2 2.98233e-40 3.23229e-37      -6.43381  -6.43381  -4.15385
## Aurkb         2 2.41436e-64 5.62593e-60      -6.94063  -6.94063  -1.65534
## Myh11         2 1.28865e-46 3.75353e-43       4.38182   4.38182   4.29290
## Mcm6          3 1.15877e-28 3.69887e-26      -5.44558  -5.44558  -5.82130
## Cdca3         3 5.02047e-45 1.23144e-41      -6.22179  -6.22179  -2.10502
## Top2a         3 7.25965e-61 1.12776e-56      -7.07811  -7.07811  -2.39123
## Mcm2          4 1.50854e-33 7.98908e-31      -5.54197  -5.54197  -6.09178
##          logFC.4
##        <numeric>
## Ccna2 -7.3451052
## Cdca8 -7.2617478
## Pirb   0.0352849
## Cks1b -6.4385323
## Aurkb -6.4162126
## Myh11  0.9410499
## Mcm6  -3.5804973
## Cdca3 -7.0539510
## Top2a -6.8297343
## Mcm2  -3.8238103

We visualize the expression profiles of the top candidates in Figure 25.7 to verify that the DE signature is robust. Most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. A cursory examination of the heatmap indicates that cluster 1 contains oncogene-induced cells with strong downregulation of DNA replication and cell cycle genes. This is consistent with the potential induction of senescence as an anti-tumorigenic response (Wajapeyee et al. 2010).

Heatmap of the top marker genes for cluster 1 in the 416B dataset, stratified by cluster. The plate of origin and oncogene induction status are also shown for each cell.

Figure 25.7: Heatmap of the top marker genes for cluster 1 in the 416B dataset, stratified by cluster. The plate of origin and oncogene induction status are also shown for each cell.

Session Info

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] cluster_2.1.0               dynamicTreeCut_1.63-1      
 [3] limma_3.46.0                scran_1.18.0               
 [5] scater_1.18.0               ggplot2_3.3.2              
 [7] AnnotationHub_2.22.0        BiocFileCache_1.14.0       
 [9] dbplyr_1.4.4                ensembldb_2.14.0           
[11] AnnotationFilter_1.14.0     GenomicFeatures_1.42.0     
[13] AnnotationDbi_1.52.0        scRNAseq_2.4.0             
[15] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
[17] Biobase_2.50.0              GenomicRanges_1.42.0       
[19] GenomeInfoDb_1.26.0         IRanges_2.24.0             
[21] S4Vectors_0.28.0            BiocGenerics_0.36.0        
[23] MatrixGenerics_1.2.0        matrixStats_0.57.0         
[25] BiocStyle_2.18.0            rebook_1.0.0               

loaded via a namespace (and not attached):
  [1] igraph_1.2.6                  lazyeval_0.2.2               
  [3] BiocParallel_1.24.0           digest_0.6.27                
  [5] htmltools_0.5.0               viridis_0.5.1                
  [7] magrittr_1.5                  memoise_1.1.0                
  [9] Biostrings_2.58.0             askpass_1.1                  
 [11] prettyunits_1.1.1             colorspace_1.4-1             
 [13] blob_1.2.1                    rappdirs_0.3.1               
 [15] xfun_0.19                     dplyr_1.0.2                  
 [17] callr_3.5.1                   crayon_1.3.4                 
 [19] RCurl_1.98-1.2                graph_1.68.0                 
 [21] glue_1.4.2                    gtable_0.3.0                 
 [23] zlibbioc_1.36.0               XVector_0.30.0               
 [25] DelayedArray_0.16.0           BiocSingular_1.6.0           
 [27] scales_1.1.1                  pheatmap_1.0.12              
 [29] DBI_1.1.0                     edgeR_3.32.0                 
 [31] Rcpp_1.0.5                    viridisLite_0.3.0            
 [33] xtable_1.8-4                  progress_1.2.2               
 [35] dqrng_0.2.1                   bit_4.0.4                    
 [37] rsvd_1.0.3                    httr_1.4.2                   
 [39] RColorBrewer_1.1-2            ellipsis_0.3.1               
 [41] pkgconfig_2.0.3               XML_3.99-0.5                 
 [43] farver_2.0.3                  scuttle_1.0.0                
 [45] CodeDepends_0.6.5             locfit_1.5-9.4               
 [47] tidyselect_1.1.0              labeling_0.4.2               
 [49] rlang_0.4.8                   later_1.1.0.1                
 [51] munsell_0.5.0                 BiocVersion_3.12.0           
 [53] tools_4.0.3                   generics_0.1.0               
 [55] RSQLite_2.2.1                 ExperimentHub_1.16.0         
 [57] evaluate_0.14                 stringr_1.4.0                
 [59] fastmap_1.0.1                 yaml_2.2.1                   
 [61] processx_3.4.4                knitr_1.30                   
 [63] bit64_4.0.5                   purrr_0.3.4                  
 [65] sparseMatrixStats_1.2.0       mime_0.9                     
 [67] xml2_1.3.2                    biomaRt_2.46.0               
 [69] compiler_4.0.3                beeswarm_0.2.3               
 [71] curl_4.3                      interactiveDisplayBase_1.28.0
 [73] tibble_3.0.4                  statmod_1.4.35               
 [75] stringi_1.5.3                 highr_0.8                    
 [77] ps_1.4.0                      lattice_0.20-41              
 [79] bluster_1.0.0                 ProtGenerics_1.22.0          
 [81] Matrix_1.2-18                 vctrs_0.3.4                  
 [83] pillar_1.4.6                  lifecycle_0.2.0              
 [85] BiocManager_1.30.10           BiocNeighbors_1.8.0          
 [87] cowplot_1.1.0                 bitops_1.0-6                 
 [89] irlba_2.3.3                   httpuv_1.5.4                 
 [91] rtracklayer_1.50.0            R6_2.5.0                     
 [93] bookdown_0.21                 promises_1.1.1               
 [95] gridExtra_2.3                 vipor_0.4.5                  
 [97] codetools_0.2-16              assertthat_0.2.1             
 [99] openssl_1.4.3                 withr_2.3.0                  
[101] GenomicAlignments_1.26.0      Rsamtools_2.6.0              
[103] GenomeInfoDbData_1.2.4        hms_0.5.3                    
[105] grid_4.0.3                    beachmat_2.6.0               
[107] rmarkdown_2.5                 DelayedMatrixStats_1.12.0    
[109] Rtsne_0.15                    shiny_1.5.0                  
[111] ggbeeswarm_0.6.0             

Bibliography

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11): 1795–1806.

Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1): 171–81.

Wajapeyee, N., S. Z. Wang, R. W. Serra, P. D. Solomon, A. Nagarajan, X. Zhu, and M. R. Green. 2010. “Senescence induction in human fibroblasts and hematopoietic progenitors by leukemogenic fusion proteins.” Blood 115 (24): 5057–60.