Contents

Latent Embedding Multivariate Regression (LEMUR)

The goal of lemur is to simplify analysis the of multi-condition single-cell data. If you have collected a single-cell RNA-seq dataset with more than one condition, lemur predicts for each cell and gene how much the expression would change if the cell had been in the other condition. Furthermore, lemur finds neighborhoods of cells that show consistent differential expression. The results are statistically validated using a pseudo-bulk differential expression test on hold-out data using glmGamPoi or edgeR.

lemur implements a novel framework to disentangle the effects of known covariates, latent cell states, and their interactions. At the core, is a combination of matrix factorization and regression analysis implemented as geodesic regression on Grassmann manifolds. We call this latent embedding multivariate regression. For more details see our preprint.

Schematic of the matrix decomposition at the core of LEMUR

Installation

You can install lemur directly from Bioconductor (available since version 3.18). Just paste the following snippet into your R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("lemur")

Alternatively, you can install the package from Github using devtools:

devtools::install_github("const-ae/lemur")

lemur depends on recent features from glmGamPoi, so make sure that packageVersion("glmGamPoi") is larger than 1.12.0.

A word of caution

This package is being actively developed, and I am still making breaking changes. I would be delighted if you decide to try out the package. Please do open an issue if you think you found a bug, have an idea for a cool feature, or have any questions about how LEMUR works. Consider this an alpha release with the goal to gather feedback, but be aware that code written against the current version of lemur might not work in the future.

Quick start

library("lemur")
library("SingleCellExperiment")

fit <- lemur(sce, design = ~ patient_id + condition, n_embedding = 15)
fit <- align_harmony(fit)   # This step is optional
fit <- test_de(fit, contrast = cond(condition = "ctrl") - cond(condition = "panobinostat"))
nei <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))

A worked through example

We will demonstrate lemur using a dataset published by Zhao et al. (2021). The data consist of tumor biopsies from five glioblastomas which were treated with the drug panobinostat and with a control. Accordingly, we will analyze ten samples (patient-treatment combinations) using a paired experimental design.

We start by loading some required packages.

library("tidyverse")
library("SingleCellExperiment")
library("lemur")
set.seed(42)

We use a reduced-size version of the glioblastoma data that ships with the lemur package.

data(glioblastoma_example_data)
glioblastoma_example_data
#> class: SingleCellExperiment 
#> dim: 300 5000 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468
#>   ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Initially, the data separates by the known covariates patient_id and condition.

orig_umap <- uwot::umap(as.matrix(t(logcounts(glioblastoma_example_data))))

as_tibble(colData(glioblastoma_example_data)) %>%
  mutate(umap = orig_umap) %>%
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = patient_id, shape = condition), size = 0.5) +
    labs(title = "UMAP of logcounts")

We fit the LEMUR model by calling lemur(). We provide the experimental design using a formula. The elements of the formula can refer to columns of the colData of the SingleCellExperiment object.

We also set the number of latent dimensions (n_embedding), which has a similar interpretation as the number of dimensions in PCA.

The test_fraction argument sets the fraction of cells which are exclusively used to test for differential expression and not for inferring the LEMUR parameters. It balances the sensitivity to detect subtle patterns in the latent space against the power to detect differentially expressed genes.

fit <- lemur(glioblastoma_example_data, design = ~ patient_id + condition, 
             n_embedding = 15, test_fraction = 0.5)
#> Storing 50% of the data (2500 cells) as test data.
#> Regress out global effects using linear method.
#> Find base point for differential embedding
#> Fit differential embedding model
#> Initial error: 1.78e+06
#> ---Fit Grassmann linear model
#> Final error: 1.11e+06

fit
#> class: lemur_fit 
#> dim: 300 5000 
#> metadata(9): n_embedding design ... use_assay row_mask
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468
#>   ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(2): linearFit embedding
#> mainExpName: NULL
#> altExpNames(0):

The lemur() function returns a lemur_fit object which extends SingleCellExperiment. It supports subsetting and all the usual data accessor methods (e.g., nrow, assay, colData, rowData). In addition, lemur overloads the $ operator to allow easy access to additional fields produced by the LEMUR model. For example, the low-dimensional embedding can be accessed using fit$embedding:

Optionally, we can further align corresponding cells using manually annotated cell types (align_by_grouping) or an automated alignment procedure (e.g., align_harmony). This ensures that corresponding cells are close to each other in the fit$embedding.

fit <- align_harmony(fit)
#> Select cells that are considered close with 'harmony'

I will make a UMAP of the fit$embedding. This is similar to working on the integrated PCA space in a traditional single-cell analysis.

umap <- uwot::umap(t(fit$embedding))

as_tibble(fit$colData) %>%
  mutate(umap = umap) %>%
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = patient_id), size = 0.5) +
    facet_wrap(vars(condition)) +
    labs(title = "UMAP of latent space from LEMUR")

Next, we will predict the effect of the panobinostat treatment for each gene and cell. The test_de function takes a lemur_fit object and returns the object with a new assay "DE". This assay contains the predicted log fold change between the conditions specified in contrast. Note that lemur implements a special notation for contrasts. Instead of providing a contrast vector or design matrix column names, you provide for each condition the levels, and lemur automatically forms the contrast vector. This makes the contrast more readable.

fit <- test_de(fit, contrast = cond(condition = "panobinostat") - cond(condition = "ctrl"))

We can pick any gene and show the differential expression pattern on the UMAP plot:

sel_gene <- "ENSG00000172020" # is GAP43

tibble(umap = umap) %>%
  mutate(de = assay(fit, "DE")[sel_gene,]) %>%
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de)) +
    scale_color_gradient2() +
    labs(title = "Differential expression on UMAP plot")

Alternatively, we can use the matrix of differential expression values (assay(fit, "DE")) to guide the selection of cell neighborhoods that show consistent differential expression. find_de_neighborhoods validates the results with a pseudobulked diferential expression test. For this it uses the fit$test_data which was put aside in the first lemur() call. In addition, find_de_neighborhoods assess if the difference between the conditions is significantly larger for the cells inside the neighborhood than the cells outside the neighborhood (see columns starting with did, short for difference-in-difference).

The group_by argument determines how the pseudobulk samples are formed. It specifies the columns in the fit$colData that are used to define a sample and is inspired by the group_by function in dplyr. Typically, you provide the covariates that were used for the experimental design plus the sample id (in this case patient_id).

neighborhoods <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))
#> Find optimal neighborhood using zscore.
#> Validate neighborhoods using test data
#> Form pseudobulk (summing counts)
#> Calculate size factors for each gene
#> Fit glmGamPoi model on pseudobulk data
#> Fit diff-in-diff effect

as_tibble(neighborhoods) %>%
  left_join(as_tibble(rowData(fit)[,1:2]), by = c("name" = "gene_id")) %>%
  relocate(symbol, .before = "name") %>%
  arrange(pval) %>%
  dplyr::select(symbol, neighborhood, name, n_cells, pval, adj_pval, lfc, did_lfc) 
#> # A tibble: 300 × 8
#>    symbol neighborhood  name            n_cells     pval adj_pval    lfc did_lfc
#>    <chr>  <I<list>>     <chr>             <int>    <dbl>    <dbl>  <dbl>   <dbl>
#>  1 MT1X   <chr [2,473]> ENSG00000187193    2473  5.54e-6  0.00166  3.21  -0.761 
#>  2 PMP2   <chr [3,933]> ENSG00000147588    3933  4.71e-5  0.00707 -1.34   0.760 
#>  3 NEAT1  <chr [3,703]> ENSG00000245532    3703  2.70e-4  0.0174   1.91  -0.641 
#>  4 POLR2L <chr [3,801]> ENSG00000177700    3801  3.16e-4  0.0174   1.24  -0.199 
#>  5 CALM1  <chr [3,980]> ENSG00000198668    3980  3.45e-4  0.0174   0.826 -0.0884
#>  6 MT1E   <chr [1,666]> ENSG00000169715    1666  3.47e-4  0.0174   3.21  -1.21  
#>  7 ATP5G3 <chr [3,982]> ENSG00000154518    3982  6.81e-4  0.0292   0.710 -0.325 
#>  8 SKP1   <chr [3,262]> ENSG00000113558    3262  8.46e-4  0.0317   0.745 -0.419 
#>  9 PLIN2  <chr [2,619]> ENSG00000147872    2619  1.09e-3  0.0345  -1.62   1.52  
#> 10 NUCKS1 <chr [3,909]> ENSG00000069275    3909  1.24e-3  0.0345  -1.05  -0.0675
#> # ℹ 290 more rows

To continue, we investigate one gene for which the neighborhood shows a significant differential expression pattern: here we choose a CXCL8 (also known as interleukin 8), an important inflammation signalling molecule. We see that it is upregulated by panobinostat in a subset of cells (blue). We chose this gene because it (1) had a significant change between panobinostat and negative control condition (adj_pval column) and (2) showed much larger differential expression for the cells inside the neighborhood than for the cells outside (did_lfc column).

sel_gene <- "ENSG00000169429" # is CXCL8

tibble(umap = umap) %>%
  mutate(de = assay(fit, "DE")[sel_gene,]) %>%
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de)) +
    scale_color_gradient2() +
    labs(title = "Differential expression on UMAP plot")

To plot the boundaries of the differential expression neighborhood, we create a helper dataframe and use the geom_density2d function from ggplot2. To avoid the cutting of the boundary to the extremes of the cell coordinates, add lims to the plot with an appropriately large limit.

neighborhood_coordinates <- neighborhoods %>%
  dplyr::filter(name == sel_gene) %>%
  unnest(c(neighborhood)) %>%
  dplyr::rename(cell_id = neighborhood) %>%
  left_join(tibble(cell_id = rownames(umap), umap), by = "cell_id") %>%
  dplyr::select(name, cell_id, umap)

tibble(umap = umap) %>%
  mutate(de = assay(fit, "DE")[sel_gene,]) %>%
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de)) +
    scale_color_gradient2() +
    geom_density2d(data = neighborhood_coordinates, breaks = 0.5, 
                   contour_var = "ndensity", color = "black") +
    labs(title = "Differential expression with neighborhood boundary")

To summarize the results, we make a volcano plot of the differential expression results to better understand the expression differences across all genes.

neighborhoods %>%
  drop_na() %>%
  ggplot(aes(x = lfc, y = -log10(pval))) +
    geom_point(aes(color  = adj_pval < 0.1)) +
    labs(title = "Volcano plot of the neighborhoods")


neighborhoods %>%
  drop_na() %>%
  ggplot(aes(x = n_cells, y = -log10(pval))) +
    geom_point(aes(color  = adj_pval < 0.1)) +
    labs(title = "Neighborhood size vs neighborhood significance")

This analysis was conducted without using any cell type information. Often, additional cell type information is available or can be annotated manually. Here, we can for example distinguish the tumor cells from cells of the microenvironment, because the tumors had a chromosome 10 deletion and chromosome 7 duplication. We build a simple classifier to distinguish the cells accordingly. (This is just to illustrate the process; for a real analysis, we would use more sophisticated methods.)

tumor_label_df <- tibble(cell_id = colnames(fit),
       chr7_total_expr = colMeans(logcounts(fit)[rowData(fit)$chromosome == "7",]),
       chr10_total_expr = colMeans(logcounts(fit)[rowData(fit)$chromosome == "10",])) %>%
  mutate(is_tumor = chr7_total_expr > 0.8 & chr10_total_expr < 2.5)


ggplot(tumor_label_df, aes(x = chr10_total_expr, y = chr7_total_expr)) +
    geom_point(aes(color = is_tumor), size = 0.5) +
    geom_hline(yintercept = 0.8) +
    geom_vline(xintercept = 2.5) +
    labs(title = "Simple gating strategy to find tumor cells")


tibble(umap = umap) %>%
  mutate(is_tumor = tumor_label_df$is_tumor) %>%
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = is_tumor), size = 0.5) +
    labs(title = "The tumor cells are enriched in parts of the big blob") +
    facet_wrap(vars(is_tumor))