1 A CITE-seq example

The purpose of this vignette is to illustrate the feasibility of reflecting the material in the online tutorial for scvi-tools 0.20.0 in Bioconductor. The authors of the tutorial describe it as producing

a joint latent representation of cells, denoised data for both protein and RNA

Additional tasks include

integrat[ing] datasets, and comput[ing] differential expression of RNA and protein

The integration concerns the simultaneous analysis of two datasets from 10x genomcs.

In this vignette we carry out the bulk of the tutorial activities using R and Bioconductor, reaching to scvi-tools python code via basilisk.

1.1 Retrieval of PBMC data

The following chunk will acquire (and cache, using BiocFileCache) a preprocessed version of the 10k and 5k combined CITE-seq experiments from the scvi-tools data repository.

adref = getCiteseq5k10kPbmcs()
## AnnData object with n_obs × n_vars = 10849 × 4000
##     obs: 'n_genes', 'percent_mito', 'n_counts', 'batch', '_scvi_labels', '_scvi_batch'
##     var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
##     uns: '_scvi_manager_uuid', '_scvi_uuid', 'hvg', 'log1p'
##     obsm: 'protein_expression'
##     layers: 'counts'

1.2 Retrieval of fitted VAE

The totalVI variational autoencoder was fit with these data. A fitted version is retrieved and cached using

vae = getCiteseqTutvae()

This is an instance of an S3 class, python.builtin.object, defined in the reticulate package.

## [1] "scvi.model._totalvi.TOTALVI"               
## [2] "scvi.model.base._rnamixin.RNASeqMixin"     
## [3] "scvi.model.base._vaemixin.VAEMixin"        
## [4] "scvi.model.base._archesmixin.ArchesMixin"  
## [5] "scvi.model.base._base_model.BaseModelClass"
## [6] "scvi.autotune._types.TunableMixin"         
## [7] "python.builtin.object"

Some fields of interest that are directly available from the instance include an indicator of the trained state, the general parameters used to train, and the “anndata” (annotated data) object that includes the input counts and various results of preprocessing:

## [1] TRUE
## TotalVI Model with the following params: 
## n_latent: 20, gene_dispersion: gene, protein_dispersion: protein, gene_likelihood: nb, latent_distribution: normal
## AnnData object with n_obs × n_vars = 10849 × 4000
##     obs: 'n_genes', 'percent_mito', 'n_counts', 'batch', '_scvi_labels', '_scvi_batch'
##     var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
##     uns: '_scvi_manager_uuid', '_scvi_uuid', 'hvg', 'log1p'
##     obsm: 'protein_expression'
##     layers: 'counts'

The structure of the VAE is reported using


This is quite voluminous and is provided in an appendix.

1.3 Trace of (negative) ELBO values

The negative “evidence lower bound” (ELBO) is a criterion that is minimized in order to produce a fitted autoencoder. The scvi-tools totalVAE elgorithm creates a nonlinear projection of the inputs to a 20-dimensional latent space, and a decoder that transforms object positions in the latent space to positions in the space of observations that are close to the original input positions.

The negative ELBO values are computed for samples from the training data and for “left out” validation samples. Details on the validation assessment would seem to be part of pytorch lightning. More investigation of scvi-tools code and doc are in order.

h = vae$history
npts = nrow(h$elbo_train)
plot(seq_len(npts), as.numeric(h$elbo_train[[1]]), ylim=c(1200,1400), 
  type="l", col="blue", main="Negative ELBO over training epochs",
  ylab="neg. ELBO", xlab="epoch")
graphics::legend(300, 1360, lty=1, col=c("blue", "orange"), legend=c("training", "validation"))
graphics::lines(seq_len(npts), as.numeric(h$elbo_validation[[1]]), type="l", col="orange")

1.4 Normalized quantities

On a CPU, the following can take a long time.

NE = vae$get_normalized_expression(n_samples=25L, 
    transform_batch=c("PBMC10k", "PBMC5k")

We provide the totalVI-based denoised quantities in

denoised = getTotalVINormalized5k10k()
vapply(denoised, dim, integer(2))
##      rna_nmlzd prot_nmlzd
## [1,]     10849      10849
## [2,]      4000         14

Note that these have features as columns, samples (cells) as rows.

## [1] "AL645608.8" "HES4"       "ISG15"      "TTLL10"     "TNFRSF18"  
## [6] "TNFRSF4"
## [1] "CD3_TotalSeqB"  "CD4_TotalSeqB"  "CD8a_TotalSeqB" "CD14_TotalSeqB"
## [5] "CD15_TotalSeqB" "CD16_TotalSeqB"

1.5 UMAP projection of Leiden clustering in the totalVI latent space

We have stored a fully loaded anndata instance for retrieval to inspect the latent space and clustering produced by the tutorial notebook procedure. The images produced here do not agree exactly with what I see in the colab pages for 0.20.0. The process was run in Jetstream2, not in colab.

full = getTotalVI5k10kAdata()
# class distribution
cllabs = full$obs$leiden_totalVI
blabs = full$obs$batch
## cllabs
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2287 1899 1138  866  787  660  637  587  461  375  334  261  260  208   59   19 
##   16 
##   11
um = full$obsm$get("X_umap") 
dd = data.frame(umap1=um[,1], umap2=um[,2], clust=factor(cllabs), batch=blabs)
ggplot(dd, aes(x=umap1, y=umap2, colour=clust)) + geom_point(size=.05) +
   guides(color = guide_legend(override.aes = list(size = 4)))