In this vignette, we will cover some of the possible sources for publicly available single-cell data, how to format it for and process it with MCIA and how to run a basic analysis in order to annotate the data and use the results as metadata for exploring the decomposition results.
To install the development version from GitHub:
# devel version
# install.packages("devtools")
devtools::install_github("Muunraker/nipalsMCIA", ref = "devel",
force = TRUE, build_vignettes = TRUE) # devel version
To install the released Bioconductor version:
# release version
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("nipalsMCIA")
# note that the TENxPBMCData package is not included in this list as you may
# decide to pull data from another source or use our provided objects
library(BiocFileCache)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(nipalsMCIA)
library(piggyback)
library(Seurat)
# NIPALS starts with a random vector
set.seed(42)
# if you would like to save any of the data loaded/created locally
path_data <- file.path("..", "data")
To manage the data used in this vignette, we will be using BiocFileCache
:
# as suggested in: https://bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html#cache-to-manage-package-data
# since we only use the cache in this vignette, we don't need wrapper functions
# initialize the cache in a temporary directory
bfc <- BiocFileCache::BiocFileCache()
# set up or load the unique resource ids
rid <- BiocFileCache::bfcquery(bfc, query = "vignette_2", "rname")$rid
# rid <- bfcrid(bfc) # we only have one rname, so you could also just use this
# if the data is not in the cache, download and add it
if (length(rid) == 0) {
# specify `tag = ` to use a different release other than latest
urls <-
piggyback::pb_download_url(repo = "Muunraker/nipalsMCIA", tag = "latest")
# you could add all of the urls to rid at once, but it would throw this error
# when it comes to downloading them:
# "Error in parse_url(url) : length(url) == 1 is not TRUE"
for (url in urls) {
# the rids are numbered in the order they are added e.g. BFC1, BFC2, ...
message(paste("Downloading file", url))
rid <- names(BiocFileCache::bfcadd(bfc, rname = "vignette_2", url))
# check if the cached data needs updating and download it if needed
if (!isFALSE(BiocFileCache::bfcneedsupdate(bfc, rid))) {
# this will overwrite existing files without asking
BiocFileCache::bfcdownload(bfc, rid, ask = FALSE)
}
}
}
# return the paths to the cached files
# BiocFileCache::bfcrpath(bfc, rid = rid)
The temporary directory being used is:
# you can also run tools::R_user_dir("BiocFileCache", which = "cache")
BiocFileCache::getBFCOption("CACHE")
## [1] "/home/biocbuild/.cache/R/BiocFileCache"
The lines for the data_sc_sce.Rda are dotted because we do not provide this object below; however, the code describing how to generate it is still included.
We will be using the 10x Genomics “pbmc5k-CITEseq” dataset, which was published in May 2019 and processed using Cell Ranger 3.0.2. It contains 5,247 detected cells from PBMCs and 33,538 genes along with 32 cell surface markers. We chose this dataset due to it containing both gene expression (GEX) and cell surface protein (ADT) data as well as being relatively recent, publicly available and from a widely used platform.
The following code details several ways in which to load in this dataset. You may prefer to use data from other sources outside of 10x Genomics, in which case you will have to format it to work with nipalsMCIA.
We have the objects described below available in data files which you can download and load in here if you do not want to run the following sections. Note: the processed object that you will need in order to run MCIA (data_blocks_sc.Rda) is currently commented out as we do not actively use it in this vignette and instead read in the results directly to save on rendering time and memory. You can still download it if you choose to as we provide the file.
# list all of the currently available files in the latest release
piggyback::pb_list(repo = "Muunraker/nipalsMCIA", tag = "latest")
Get the file paths from the cache (for clarity, we actually read in each file in the following sections instead of loading them in here):
# files needed for running MCIA
path_metadata_sc <-
BiocFileCache::bfcquery(bfc, query = "metadata_sc")$rpath
# MCIA results
# path_data_blocks <- bfcquery(bfc, query = "data_blocks")$rpath
path_mcia_results <-
BiocFileCache::bfcquery(bfc, query = "mcia_results_sc")$rpath
# marker genes for cell type annotation with Seurat
path_marker_genes <-
BiocFileCache::bfcquery(bfc, query = "marker_genes")$rpath
# the Seurat data's metric summary file from 10x Genomics
path_metrics_summary <-
BiocFileCache::bfcquery(bfc, query = "metrics_summary")$rpath
# Seurat objects in different stages of processing
path_obj_raw <-
BiocFileCache::bfcquery(bfc, query = "CITEseq_raw")$rpath
path_obj_annotated <-
BiocFileCache::bfcquery(bfc, query = "CITEseq_annotated")$rpath
There are several Bioconductor packages which provide single-cell data for users as part of the package. The TENxPBMCData contains 9 different publicly available datasets from 10x Genomics (stored as SingleCellExperiment classes), including the one that we will be analyzing.
Note that the CITE-Seq information is being stored as an “alternative experiment” within the SingleCellExperiment object. Both the GEX ("mainExpName: Gene Expression"
) and ADT ("altExpNames(1): Antibody Capture"
) data were stored in the DelayedMatrix
format since they are so large. In this object, the rows represent the features and the columns represent the cells.
# read in the data as a SingleCellExperiment object
tenx_pbmc5k <- TENxPBMCData::TENxPBMCData(dataset = "pbmc5k-CITEseq")
# examine the data
tenx_pbmc5k
## class: SingleCellExperiment
## dim: 33538 5247
## metadata(0):
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674
## rowData names(4): ENSEMBL_ID Symbol_TENx Type Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## mainExpName: Gene Expression
## altExpNames(1): Antibody Capture
counts(tenx_pbmc5k)
## <33538 x 5247> sparse matrix of class DelayedMatrix and type "integer":
## [, 1] [, 2] [, 3] [, 4] ... [, 5244] [, 5245] [, 5246] [, 5247]
## ENSG00000243485 0 0 0 0 . 0 0 0 0
## ENSG00000237613 0 0 0 0 . 0 0 0 0
## ENSG00000186092 0 0 0 0 . 0 0 0 0
## ENSG00000238009 0 0 0 0 . 0 0 0 0
## ENSG00000239945 0 0 0 0 . 0 0 0 0
## ... . . . . . . . . .
## ENSG00000277856 0 0 0 0 . 0 0 0 0
## ENSG00000275063 0 0 0 0 . 0 0 0 0
## ENSG00000271254 0 0 0 0 . 0 0 0 0
## ENSG00000277475 0 0 0 0 . 0 0 0 0
## ENSG00000268674 0 0 0 0 . 0 0 0 0
counts(altExp(tenx_pbmc5k))
## <32 x 5247> sparse matrix of class DelayedMatrix and type "integer":
## [, 1] [, 2] [, 3] [, 4] ... [, 5244] [, 5245] [, 5246] [, 5247]
## CD3 25 959 942 802 . 402 401 6 1773
## CD4 164 720 1647 1666 . 1417 1 46 1903
## CD8a 16 8 21 5 . 8 222 3 9
## CD11b 3011 12 11 11 . 15 7 1027 9
## CD14 696 12 13 9 . 9 17 382 8
## ... . . . . . . . . .
## HLA-DR 573 15 11 19 . 6 40 184 32
## TIGIT 10 3 3 3 . 2 15 1 12
## IgG1 4 4 2 4 . 1 0 2 4
## IgG2a 1 3 0 6 . 4 0 4 2
## IgG2b 6 2 4 8 . 0 0 2 5
# examine the metadata:
head(colData(tenx_pbmc5k), n = 3)
## DataFrame with 6 rows and 11 columns
## Sample Barcode Sequence Library Cell_ranger_version Tissue_status Barcode_type
## <character> <character> <character> <integer> <character> <character> <character>
## 1 pbmc5k-CITEseq AAACCCAAGAGACAAG-1 AAACCCAAGAGACAAG 1 v3.0.2 NA Chromium
## 2 pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 AAACCCAAGGCCTAGA 1 v3.0.2 NA Chromium
## 3 pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 AAACCCAGTCGTGCCA 1 v3.0.2 NA Chromium
## Chemistry Sequence_platform Individual Date_published
## <character> <character> <character> <character>
## 1 Chromium_v3 NovaSeq HealthyDonor 2019-05-29
## 2 Chromium_v3 NovaSeq HealthyDonor 2019-05-29
## 3 Chromium_v3 NovaSeq HealthyDonor 2019-05-29
head(rowData(tenx_pbmc5k), n = 3)
## DataFrame with 6 rows and 4 columns
## ENSEMBL_ID Symbol_TENx Type Symbol
## <character> <character> <character> <character>
## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression NA
## ENSG00000237613 ENSG00000237613 FAM138A Gene Expression FAM138A
## ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression OR4F5
metadata(tenx_pbmc5k)
## list()
# change the gene names from Ensembl IDs to the 10x genes
rownames(tenx_pbmc5k) <- rowData(tenx_pbmc5k)$Symbol_TENx
In order to run MCIA, the format of this data must be slightly modified:
# set up the list
data_blocks_sc_sce <- list()
data_blocks_sc_sce$mrna <- data.frame(as.matrix(counts(tenx_pbmc5k)))
data_blocks_sc_sce$adt <- data.frame(as.matrix(counts(altExp(tenx_pbmc5k))))
summary(data_blocks_sc_sce)
## Length Class Mode
## mrna 5247 data.frame list
## adt 5247 data.frame list
# convert to a Seurat object (using `as.Seurat` won't work here)
obj_sce <- CreateSeuratObject(counts = data_blocks_sc_sce$mrna, # assay = "RNA"
project = "pbmc5k_CITEseq")
obj_sce[["ADT"]] <- CreateAssayObject(counts = data_blocks_sc_sce$adt)
# name the cells with their barcodes
obj_sce <- RenameCells(object = obj_sce,
new.names = colData(tenx_pbmc5k)$Sequence)
# add metadata from the SingleCellExperiment object
obj_sce <- AddMetaData(object = obj_sce,
metadata = as.data.frame(colData(tenx_pbmc5k),
row.names = Cells(obj_sce)))
# this object will be slightly different than from the Seurat one down below
# e.g. 5297 rows vs. 4193 rows (since QC wasn't done) and different metadata
head(obj_sce[[]], n = 3)
## orig.ident nCount_RNA nFeature_RNA nCount_ADT nFeature_ADT Sample Barcode
## AAACCCAAGAGACAAG SeuratProject 7375 2363 5178 31 pbmc5k-CITEseq AAACCCAAGAGACAAG-1
## AAACCCAAGGCCTAGA SeuratProject 3772 1259 2893 29 pbmc5k-CITEseq AAACCCAAGGCCTAGA-1
## AAACCCAGTCGTGCCA SeuratProject 4902 1578 3635 29 pbmc5k-CITEseq AAACCCAGTCGTGCCA-1
## Sequence Library Cell_ranger_version Tissue_status Barcode_type Chemistry Sequence_platform
## AAACCCAAGAGACAAG AAACCCAAGAGACAAG 1 v3.0.2 <NA> Chromium Chromium_v3 NovaSeq
## AAACCCAAGGCCTAGA AAACCCAAGGCCTAGA 1 v3.0.2 <NA> Chromium Chromium_v3 NovaSeq
## AAACCCAGTCGTGCCA AAACCCAGTCGTGCCA 1 v3.0.2 <NA> Chromium Chromium_v3 NovaSeq
## Individual Date_published
## AAACCCAAGAGACAAG HealthyDonor 2019-05-29
## AAACCCAAGGCCTAGA HealthyDonor 2019-05-29
## AAACCCAGTCGTGCCA HealthyDonor 2019-05-29
# save the data locally if desired
save(data_blocks_sc_sce, obj_sce,
file = file.path(path_data, "data_sc_sce.Rda"))
The original dataset can be found on the 10x Genomics Datasets website and an explanation of the file types for this version can be found here. You can download them all to a directory of your choosing with their suggested terminal commands:
# Input Files
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_fastqs.tar
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_feature_ref.csv
# Output Files
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_possorted_genome_bam.bam
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_possorted_genome_bam.bam.bai
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_molecule_info.h5
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_raw_feature_bc_matrix.h5
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_raw_feature_bc_matrix.tar.gz
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_analysis.tar.gz
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_metrics_summary.csv
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_web_summary.html
curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_cloupe.cloupe
If you would like to view some basic information about the data, you can open the web_summary.html in your favorite web browser. It will show the estimated number of cells, mean reads per cell, median genes per cell and a variety of other sample metrics. These metrics are also available in the metrics_summary.csv file.
You will need the filtered_feature_bc_matrix directory for the following analysis; it can be extracted with tar -xvzf 5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz
from within the relevant data directory.
# load the data (change the file path as needed)
data <- Seurat::Read10X(data.dir = file.path(path_data, "tenx_pbmc5k_CITEseq",
"filtered_feature_bc_matrix"),
strip.suffix = TRUE) # remove the "-1"s from barcodes
## 10X data contains more than one type and is being returned as a list
## containing matrices of each type.
# set minimum cells and/or features here if you'd like
obj <- Seurat::CreateSeuratObject(counts = data$`Gene Expression`,
project = "pbmc5k_CITEseq")
obj[["ADT"]] <- Seurat::CreateAssayObject(counts = data$`Antibody Capture`)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
# check the assays
Seurat::Assays(object = obj)
## "RNA" "ADT"
# list out the CITE-Seq surface protein markers
rownames(obj[["ADT"]])
## [1] "CD3-TotalSeqB" "CD4-TotalSeqB" "CD8a-TotalSeqB"
## [4] "CD11b-TotalSeqB" "CD14-TotalSeqB" "CD15-TotalSeqB"
## [7] "CD16-TotalSeqB" "CD19-TotalSeqB" "CD20-TotalSeqB"
## [10] "CD25-TotalSeqB" "CD27-TotalSeqB" "CD28-TotalSeqB"
## [13] "CD34-TotalSeqB" "CD45RA-TotalSeqB" "CD45RO-TotalSeqB"
## [16] "CD56-TotalSeqB" "CD62L-TotalSeqB" "CD69-TotalSeqB"
## [19] "CD80-TotalSeqB" "CD86-TotalSeqB" "CD127-TotalSeqB"
## [22] "CD137-TotalSeqB" "CD197-TotalSeqB" "CD274-TotalSeqB"
## [25] "CD278-TotalSeqB" "CD335-TotalSeqB" "PD-1-TotalSeqB"
## [28] "HLA-DR-TotalSeqB" "TIGIT-TotalSeqB" "IgG1-control-TotalSeqB"
## [31] "IgG2a-control-TotalSeqB" "IgG2b-control-TotalSeqB"
# save the data locally if desired
saveRDS(obj, file.path(path_data, "tenx_pbmc5k_CITEseq_raw.rds"))
You can also load the Seurat object directly, such as in the Read in and process the data subsection in the Deep dive: Seurat analysis section later in the vignette.
Note that the “-1”s have been removed from the cell barcodes.
# read in the annotated cells
metadata_sc <- read.csv(file = path_metadata_sc[1], header = TRUE, row.names = 1)
# examples
metadata_sc %>% slice_sample(n = 5)
This will run on cells with the top 2000 variable features (as defined in the later analysis).
# load the object set up for running MCIA [10x Genomics & Seurat]
# (if you save it in the following code block or download it from a release)
load(file = path_data_blocks)
# "largest_sv" results in a more balanced contribution
# from the blocks than the default "unit_var"
set.seed(42)
# convert data_blocks_sc to an MAE object using the SingleCellExperiment class
data_blocks_sc_mae <-
MultiAssayExperiment::MultiAssayExperiment(lapply(data_blocks_sc, function(x)
SingleCellExperiment::SingleCellExperiment(t(as.matrix(x)))),
colData = metadata_sc)
mcia_results_sc <- nipals_multiblock(data_blocks = data_blocks_sc_mae,
col_preproc_method = "colprofile",
block_preproc_method = "largest_sv",
num_PCs = 10, tol = 1e-9,
deflationMethod = "global",
plots = "none")
## Performing column-level pre-processing...
## Column pre-processing completed.
## Performing block-level preprocessing...
## Block pre-processing completed.
## Computing order 1 scores
## Computing order 2 scores
## Computing order 3 scores
## Computing order 4 scores
## Computing order 5 scores
## Computing order 6 scores
## Computing order 7 scores
## Computing order 8 scores
## Computing order 9 scores
## Computing order 10 scores
# saveRDS(mcia_results_sc, file = file.path(path_data, "mcia_results_sc.Rds"))
# load the results of the previous block (if already run and saved)
mcia_results_sc <- readRDS(file = path_mcia_results[1])
mcia_results_sc
## NipalsResult Object with properties:
## > Dataset dimensions: 4193 x 2032
## > Number of blocks: 2
## > Order of scores: 10
## > Column preprocessing: colprofile
## > Block preprocessing: largest_sv
## > Block names and sizes:
## mrna adt
## 2000 32
This data comes from only one subject, so we will use the annotated cell types for the metadata (see the Deep dive: Seurat analysis section for details on their origin).
# for the projection plot
# technically you could just do color_pal_params = list(option = "D"), but saving
# the colors is useful for other plots like in the Seurat section
meta_colors_sc <- get_metadata_colors(mcia_results = mcia_results_sc,
color_col = "CellType",
color_pal = scales::viridis_pal,
color_pal_params = list(option = "D"))
# for other plots
colors_omics_sc <- get_colors(mcia_results = mcia_results_sc)
global_scores_eigenvalues_plot(mcia_results = mcia_results_sc)
projection_plot(mcia_results = mcia_results_sc,
projection = "global", orders = c(1, 2),
color_col = "CellType", color_pal = meta_colors_sc,
legend_loc = "bottomright")
suppressMessages(global_scores_heatmap(mcia_results = mcia_results_sc,
color_col = "CellType",
color_pal = meta_colors_sc))