1 Introduction

The MouseGastrulationData package provides convenient access to the single-cell RNA sequencing (scRNA-seq) datasets from Pijuan-Sala et al. (2019), and additional data generated in similar systems. This study focuses on mouse gastrulation and organogenesis, providing transcriptomic profiles at single-cell resolution across several stages of early development. Datasets are provided as count matrices with additional feature- and sample-level metadata after processing. Raw sequencing data can be acquired from ArrayExpress accession E-MTAB-6967.

2 Installation

The package may be installed from Bioconductor. Bioconductor packages can be accessed using the BiocManager package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MouseGastrulationData")

BiocManager also supports installation of the development version of the package from Github.

BiocManager::install("MarioniLab/MouseGastrulationData")

To use the package, load it in the typical way.

library(MouseGastrulationData)

3 Processing overview

Detailed methods are available in the methods that accompany the paper, or from the code in the corresponding Github repository. Briefly, whole embryos were dissociated at timepoints between embryonic days (E) 6.5 and 8.5 of development. Libraries were generated using the 10x Genomics Chromium platform (v1 chemistry) and sequenced on the Illumina HiSeq 2500. The computational analysis involved a number of steps:

  • Demultiplexing, read alignment and feature quantification was performed with Cellranger using Ensembl 92 genome annotation.
  • Swapped molecules were excluded using the swappedDrops() function from DropletUtils (Griffiths et al. 2018).
  • Cell-containing droplets were called using the emptyDrops() function from DropletUtils (Lun et al. 2019).
  • Called cells with aberrant transcriptional features (e.g., high mitochondrial gene content) were filtered out.
  • Size factors were computed using the computeSumFactors() function from scran (Lun, Bach, and Marioni 2016).
  • Putative doublets were identified and excluded using the doubletCells() function from scran.
  • Cytoplasm-stripped nuclei were also excluded.
  • Batch correction was performed in the principal component space with fastMNN() from scran (Haghverdi et al. 2018).
  • Clusters were identified using a recursive strategy with buildSNNGraph() (from scran) and cluster_louvain (from igraph), and were annotated and merged into interpretable units by hand.

4 Atlas data format

The data accessible via this package is stored in subsets according to the different 10x samples that were generated. For the embryo atlas, the exported object AtlasSampleMetadata provides metadata information for each of the samples. Descriptions of the contents of each column can be accessed using ?AtlasSampleMetadata.

head(AtlasSampleMetadata, n = 3)
##   sample stage pool_index seq_batch ncells
## 1      1  E6.5          1         1    360
## 2      2  E7.5          2         1    356
## 3      3  E7.5          3         1    458

All data access functions allow you to select the particular samples you would like to access. By loading only the samples that you are interested in for your particular analysis, you will save time when downloading and loading the data, and also reduce memory consumption on your machine.

4.1 Processed data access

The package provides the dataset in the form of a SingleCellExperiment object. This section details how you can interact with the object. We load in only one of the samples from the atlas to reduce memory consumption when compiling this vignette.

sce <- EmbryoAtlasData(samples = 21)
sce
## class: SingleCellExperiment 
## dim: 29452 4651 
## metadata(0):
## assays(1): counts
## rownames(29452): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ENSEMBL SYMBOL
## colnames(4651): cell_52466 cell_52467 ... cell_57115 cell_57116
## colData names(16): cell barcode ... celltype colour
## reducedDimNames(2): pca.corrected umap
## spikeNames(0):
## altExpNames(0):

We use the counts() function to retrieve the count matrix. These are stored as a sparse matrix, as implemented in the Matrix package.

counts(sce)[6:9, 1:3]
## 4 x 3 sparse Matrix of class "dgTMatrix"
##                    cell_52466 cell_52467 cell_52468
## ENSMUSG00000104328          .          .          .
## ENSMUSG00000033845          6          8         10
## ENSMUSG00000025903          .          .          .
## ENSMUSG00000104217          .          .          .

Size factors for normalisation are present in the object and are accessed with the sizeFactors() function.

head(sizeFactors(sce))
## [1] 0.8845695 1.4688375 1.2512019 0.8287969 1.3668086 0.9247460

After running scater’s normalize function on the SingleCellExperiment object, normalised or log-transformed counts can be accessed using normcounts and logcounts. These are not demonstrated in this vignette to avoid a dependency on scater.

The MGI symbol and Ensembl gene ID for each gene is stored in the rowData of the SingleCellExperiment object.

head(rowData(sce))
## DataFrame with 6 rows and 2 columns
##                               ENSEMBL      SYMBOL
##                           <character> <character>
## ENSMUSG00000051951 ENSMUSG00000051951        Xkr4
## ENSMUSG00000089699 ENSMUSG00000089699      Gm1992
## ENSMUSG00000102343 ENSMUSG00000102343     Gm37381
## ENSMUSG00000025900 ENSMUSG00000025900         Rp1
## ENSMUSG00000025902 ENSMUSG00000025902       Sox17
## ENSMUSG00000104328 ENSMUSG00000104328     Gm37323

The colData contains cell-specific attributes. The meaning of each field is detailed in the function documentation (?EmbryoAtlasData).

head(colData(sce))
## DataFrame with 6 rows and 16 columns
##                   cell        barcode    sample      pool              stage
##            <character>    <character> <integer> <integer>        <character>
## cell_52466  cell_52466 AAACATACACGGAG        21        17 mixed_gastrulation
## cell_52467  cell_52467 AAACATACCCAACA        21        17 mixed_gastrulation
## cell_52468  cell_52468 AAACATACTTGCGA        21        17 mixed_gastrulation
## cell_52469  cell_52469 AAACATTGATCGGT        21        17 mixed_gastrulation
## cell_52470  cell_52470 AAACATTGCTTATC        21        17 mixed_gastrulation
## cell_52471  cell_52471 AAACATTGGTTCGA        21        17 mixed_gastrulation
##            sequencing.batch     theiler       doub.density   doublet
##                   <integer> <character>          <numeric> <logical>
## cell_52466                2      TS9-10 0.0315539094479002     FALSE
## cell_52467                2      TS9-10    0.1362418810328     FALSE
## cell_52468                2      TS9-10  0.746897566083627     FALSE
## cell_52469                2      TS9-10  0.270453175239345     FALSE
## cell_52470                2      TS9-10  0.222603910205652     FALSE
## cell_52471                2      TS9-10  0.326151882589801     FALSE
##              cluster cluster.sub cluster.stage cluster.theiler  stripped
##            <integer>   <integer>     <integer>       <integer> <logical>
## cell_52466        14           2             5               5     FALSE
## cell_52467         3           6            12              12     FALSE
## cell_52468         2           3             3               3     FALSE
## cell_52469         1           3             1               1     FALSE
## cell_52470        19           1             5               5     FALSE
## cell_52471         5           1             4               4     FALSE
##                                  celltype      colour
##                               <character> <character>
## cell_52466            Blood progenitors 2      c9a997
## cell_52467                   ExE ectoderm      989898
## cell_52468                       Epiblast      635547
## cell_52469           Rostral neurectoderm      65A83E
## cell_52470 Haematoendothelial progenitors      FBBE92
## cell_52471               Nascent mesoderm      C594BF

Batch-corrected PCA representations of the data are available via the reducedDim function, in the pca.corrected slot. This representation contains NA values for cells that are doublets, or cytoplasm-stripped nuclei.

A vector of celltype colours (as used in the paper) is also provided in the exported object EmbryoCelltypeColours. Its use is shown below.

#exclude technical artefacts
singlets <- which(!(colData(sce)$doublet | colData(sce)$stripped))
plot(
    x = reducedDim(sce, "umap")[singlets, 1],
    y = reducedDim(sce, "umap")[singlets, 2],
    col = EmbryoCelltypeColours[colData(sce)$celltype[singlets]],
    pch = 19,
    xaxt = "n", yaxt = "n",
    xlab = "UMAP1", ylab = "UMAP2"
)