miloR 2.2.0
library(miloR)
library(SingleCellExperiment)
library(scater)
library(scran)
library(dplyr)
library(patchwork)
library(scRNAseq)
library(scuttle)
library(irlba)
library(BiocParallel)
library(ggplot2)
library(sparseMatrixStats)
Our first implementation of Milo used a negative binomial GLM to perform hypothesis testing and identify differentially abundant neighbourhoods. GLMs are incredibly powerful, but they have certain limitations. Notably, they assume that the dependent variable (nhood counts) are (conditionally) independently and identically distributed - that means there can’t be any relationship between the individual counts e.g. they can’t be derived from the same individual. This creates a dependency between count observations in the same nhood and can lead to inflated type I error rates. This is especially problematic when considering genetic analyses and organisms of the same species share a genetic ancestry, which in humans only goes back ~100,000 years. In the simplest example, imagine we performed single-cell experiments on individuals from multiple families, and from each family we included siblings and parents. Within a family the individuals would share on average 50% of their DNA, so the observations for DA testing wouldn’t be independent. For more distantly related individuals the relationships are smaller, but can be non-trivial, particularly as sample sizes increase. It’s not just genetics that leads to dependencies, multiple measurements from the same individual, e.g. multiple time points or from different organs, can also introduce dependency between observations.
We have opted to use GLMM to address this problem as they are very powerful and can explicitly account for fairly arbitrary dependencies, as long as they can be encoded either as a multi-level factor variable (sometimes referred to as clusters in the mixed effect model literature) or by an nXn matrix.
For the purpose of demonstrating how to use Milo in GLMM mode I’ll use a data set KotliarovPBMC
from the scRNAseq
package. These data are derived from SLE patients with
variable treatment responses. This should allow us to model the batching as a random effect variable while testing for differential abundance between the high and low drug
responders.
We will use the KotliarovPBMCData
data set as there are multiple groups that we can compare.
# uncomment the row below to allow multi-processing and comment out the SerialParam line.
# bpparam <- MulticoreParam(workers=4)
bpparam <- SerialParam()
register(bpparam)
pbmc.sce <- KotliarovPBMCData(mode="rna", ensembl=TRUE) # download the PBMC data from Kotliarov _et al._
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## loading from cache
## require("ensembldb")
## Warning: Unable to map 11979 of 32738 requested IDs.
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
# downsample cells to reduce compute time
prop.cells <- 0.75
n.cells <- floor(ncol(pbmc.sce) * prop.cells)
set.seed(42)
keep.cells <- sample(colnames(pbmc.sce), size=n.cells)
pbmc.sce <- pbmc.sce[, colnames(pbmc.sce) %in% keep.cells]
# downsample the number of samples
colData(pbmc.sce)$ObsID <- paste(colData(pbmc.sce)$tenx_lane, colData(pbmc.sce)$sampleid, sep="_")
n.samps <- 80
keep.samps <- sample(unique(colData(pbmc.sce)$ObsID), size=n.samps)
keep.cells <- rownames(colData(pbmc.sce))[colData(pbmc.sce)$ObsID %in% keep.samps]
pbmc.sce <- pbmc.sce[, colnames(pbmc.sce) %in% keep.cells]
pbmc.sce
## class: SingleCellExperiment
## dim: 20759 28105
## metadata(0):
## assays(1): counts
## rownames(20759): ENSG00000284557 ENSG00000237613 ... ENSG00000274412
## ENSG00000283767
## rowData names(1): originalName
## colnames(28105): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGCAGTATCTG_H1B1ln1 ...
## TTTGTCATCGGTTCGG_H1B2ln6 TTTGTCATCTACCTGC_H1B2ln6
## colData names(25): nGene nUMI ... timepoint ObsID
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
set.seed(42)
# remove sparser cells
drop.cells <- colSums(counts(pbmc.sce)) < 1000
pbmc.sce <- computePooledFactors(pbmc.sce[, !drop.cells], BPPARAM=bpparam)
pbmc.sce <- logNormCounts(pbmc.sce)
pbmc.hvg <- modelGeneVar(pbmc.sce)
pbmc.hvg$FDR[is.na(pbmc.hvg$FDR)] <- 1
pbmc.sce <- runPCA(pbmc.sce, subset_row=rownames(pbmc.sce)[pbmc.hvg$FDR < 0.1], scale=TRUE, ncomponents=50, assay.type="logcounts", name="PCA", BPPARAM=bpparam)
pbmc.sce
## class: SingleCellExperiment
## dim: 20759 24712
## metadata(0):
## assays(2): counts logcounts
## rownames(20759): ENSG00000284557 ENSG00000237613 ... ENSG00000274412
## ENSG00000283767
## rowData names(1): originalName
## colnames(24712): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGCAGTATCTG_H1B1ln1 ...
## TTTGTCATCGAGAACG_H1B2ln6 TTTGTCATCTACCTGC_H1B2ln6
## colData names(26): nGene nUMI ... ObsID sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):
set.seed(42)
pbmc.sce <- runUMAP(pbmc.sce, n_neighbors=30, pca=50, BPPARAM=bpparam) # add a UMAP for plotting results later
pbmc.milo <- Milo(pbmc.sce) # from the SCE object
reducedDim(pbmc.milo, "UMAP") <- reducedDim(pbmc.sce, "UMAP")
plotUMAP(pbmc.milo, colour_by="adjmfc.time") + plotUMAP(pbmc.milo, colour_by="sampleid")