Contents

1 Introduction

Convex Analysis of Mixtures (CAM) is a fully unsupervised computational method to analyze tissue samples composed of unknown numbers and varying proportions of distinct subpopulations (Wang et al. 2016). CAM assumes that the measured expression level is the weighted sum of each subpopulation’s expression, where the contribution from a single subpopulation is proportional to the abundance and specific expression of that subpopulation. This linear mixing model can be formulated as \(\mathbf{X'=AS'}\). CAM can identify molecular markers directly from the original mixed expression matrix, \(\mathbf{X}\), and further estimate the constituent proportion matrix, \(\mathbf{A}\), as well as subpopulation-specific expression profile matrix, \(\mathbf{S}\).

debCAM is an R package developed for tissue heterogeneity characterization by CAM algorithm. It provides basic functions to perform unsupervised deconvolution on mixture expression profiles by CAM and some auxiliary functions to help understand the subpopulation-specific results. debCAM also implements functions to perform supervised deconvolution based on prior knowledge of molecular markers, S matrix or A matrix. Semi-supervised deconvolution can also be achieved by combining molecular markers from CAM and from prior knowledge to analyze mixture expressions.

2 Quick Start

The function CAM() includes all necessary steps to decompose a matrix of mixture expression profiles. There are some optional steps upstream of CAM() that downsample the matrix for reducing running time. Each step in CAM() can also be performed separately if you prefer a more flexible workflow. More details will be introduced in sections below.

Starting your analysis by CAM(), you need to specify the range of possible subpopulation numbers and the percentage of low/high-expressed molecules to be removed. Typically, 30% ~ 50% low-expressed genes can be removed from gene expression data. Much less low-expressed proteins are removed, e.g. 0% ~ 10%, due to a limited number of proteins in proteomics data. The removal of high-expressed molecules has much less impact on results, usually set to be 0% ~ 10%.

rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95)

3 Unsupervised Deconvolution

3.1 Datatypes and Input Format

Theoretically, debCAM accepts any molecular expression data types as long as the expressions follow the linear mixing model. We have validated the feasibility of CAM in gene expression data (microarray, RNAseq), proteomics data and DNA methylation data. Requirements for the input expression data:

  • be in non-log linear space with non-negative numerical values (i.e.>=0).
  • be already processed by normalization and batch effect removal.
  • no missing values; the molecules containing missing values should be removed prior to CAM.
  • no all-zero expressed molecules; otherwise, all-zero expressed molecules will be removed internally.

The input expression data should be stored in a matrix. Data frame, SummarizedExperiment or ExpressionSet object is also accepted and will be internally coerced into a matrix format before analysis. Each column in the matrix should be a tissue sample. Each row should be a probe/gene/protein/etc. Row names should be provided so that CAM can return the names of detected markers. Otherwise, rows will be automatically named by 1,2,3,…

3.2 CAM Workflow

We use a data set downsampled from GSE19830 as an example to show CAM workflow. This data set provides gene expression profiles of pure tissues (brain, liver, lung) and their biological mixtures with different proportions.

library(debCAM)

data(ratMix3) 
#ratMix3$X: X matrix containing mixture expression profiles to be analyzed
#ratMix3$A: ground truth A matrix containing proportions
#ratMix3$S: ground truth S matrix containing subpopulation-specific expression profiles

data <- ratMix3$X
#10000 genes * 21 tissues
#meet the input data requirements

3.2.1 Analysis by CAM

Unsupervised deconvolution can be achieved by the function CAM() with a simple setting as Section 2 introduced. Other critical parameters are dim.rdc (reduced data dimension) and cluster.num (number of clusters). Increasing them will bring more time complexity. We can also specify cores for parallel computing configured by BiocParallel. cores = 0 will disable parallel computing. No cores argument will invoke one core for each element in K. Setting the seed of the random number generator prior to CAM can generate reproducible results.

set.seed(111) # set seed for internal clustering to generate reproducible results
rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95)
#CAM return three sub results: 
#rCAM@PrepResult contains details corresponding to data preprocessing.
#rCAM@MGResult contains details corresponding to marker gene clusters detection.
#rCAM@ASestResult contains details corresponding to A and S matrix estimation.

3.2.2 Model Selection by MDL Curves

We used MDL, a widely-adopted and consistent information theoretic criterion, to guide model selection. The underlying subpopulation number can be decided by minimizing the total description code length:

plot(MDL(rCAM), data.term = TRUE)

3.2.3 Estimated A and S matrix

The A and S matrix estimated by CAM with a fixed subpopulation number, e.g. K = 3, can be obtained by

Aest <- Amat(rCAM, 3)
Sest <- Smat(rCAM, 3)

3.2.4 Detected Marker Genes

The marker genes detected by CAM and used for A matrix estimation can be obtained by

MGlist <- MGsforA(rCAM, K = 3) #for three subpopulations

Data preprocessing has filtered many genes, among which there are also some biologically-meaningful marker genes. So we need to check each gene again to find all the possible markers. Two statistics based on the subpopulation-specific expressions are exploited to identify marker genes with certain thresholds. The first is OVE-FC (one versus everyone - fold change) (Yu et al. 2010). The second is the lower confidence bound of bootstrapped OVE-FC at \(\alpha\) level.

MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000)
MGlist.FC <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10])
MGlist.FCboot <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10])

The thresholds above are set arbitrary and will impact the number of resulted markers significantly. Each subpopulation can also have different threshold. To make threshold setting more easier, we can also set thresholds to be quantile of fold changes and margin-of-errors of one subpopulation’s input markers. Margin-of-error is the distance between one gene and the simplex defined by column vectors of A matrix (Wang et al. 2016). Margin-of-error threshold can be relaxed to a value larger than the maximum.

MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')
#q0.2: 0.2-quantile
#q1.2: 1-quantile (maximum) times 1.2

3.2.5 (Optional) Re-estimation

It is optional to re-estimate A and S matrix based on the new marker list and/or apply Alternating Least Square (ALS) method to further reduce mean squared error. Note that allowing too many iterations of ALS may bring the risk of a significant deviation from initial values. The constraint for methylation data, \(\mathbf{S\in[0,1]}\), will be imposed during re-estimation.

rre <- redoASest(data, MGlist.re, maxIter = 2, methy = FALSE)
#rre$Aest: re-estimated A matrix
#rre$Sest: re-estimated S matrix

3.2.6 Simplex Plot - 2D

Fundamental to the success of CAM is that the scatter simplex of mixed expressions is a rotated and compressed version of the scatter simplex of pure expressions, where the marker genes are located at each vertex. simplexplot() function can show the scatter simplex and detected marker genes in a 2D plot. The vertices in the high-dimensional simplex will still locate at extreme points of the low-dimensional simplex.

layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
simplexplot(data, Aest, MGlist, main=c('Initially detected markers'))
simplexplot(data, Aest, MGlist.FC, main=c('fc > 10'))
simplexplot(data, Aest, MGlist.FCboot, 
            main=c(expression(bold(paste('fc(bootstrap,',alpha,'=0.05) > 10')))))
simplexplot(data, Aest, MGlist.re, main=c("fc >= 'q0.2', error <= 'q1.2'"))

The colors and the vertex order displayed in 2D plot can be changed as

simplexplot(data, Aest, MGlist.FCboot, 
            data.extra=rbind(t(ratMix3$A),t(Aest)),
            corner.order = c(2,1,3), col = "blue", 
            mg.col = c("red","orange","green"),
            ex.col = "black", ex.pch = c(19,19,19,17,17,17))
legend("bottomright", cex=1.2, inset=.01, c("Ground Truth","CAM Estimation"),
       pch=c(19,17), col="black")

3.2.7 Simplex Plot - 3D

We can also observe the convex cone and simplex of mixture expressions in 3D space by PCA. Note that PCA cannot guarantee that vertices are still preserved as extreme points of the dimension-reduced simplex.

Code to show convex cone:

library(rgl)
Xp <- data %*% t(PCAmat(rCAM))
plot3d(Xp[, 1:3], col='gray', size=3, 
       xlim=range(Xp[,1]), ylim=range(Xp[,2:3]), zlim=range(Xp[,2:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
    points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}

Code to show simplex:

library(rgl)
clear3d()
Xproj <- XWProj(data, PCAmat(rCAM))
Xp <- Xproj[,-1]
plot3d(Xp[, 1:3], col='gray', size=3, 
       xlim=range(Xp[,1:3]), ylim=range(Xp[,1:3]), zlim=range(Xp[,1:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
    points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}

3.2.8 Validation with Ground Truth

If the ground truth A and S matrix are available, the estimation from CAM can be evaluated:

cor(Aest, ratMix3$A)
#>           Liver      Brain       Lung
#> [1,] -0.3963187  0.9839210 -0.4898373
#> [2,]  0.9869012 -0.5991347 -0.4865204
#> [3,] -0.4769864 -0.5245692  0.9856413
cor(Sest, ratMix3$S)
#>          Liver     Brain      Lung
#> [1,] 0.5047310 0.9752287 0.5648873
#> [2,] 0.9732946 0.2578350 0.3488423
#> [3,] 0.4680238 0.5741921 0.9935207

Considering the presence of many co-expressed genes (housekeeping genes) may dominate the correlation coefficients between ground truth and estimated expression profiles, it is better to assess correlation coefficients over marker genes only.

unlist(lapply(seq_len(3), function(k) {
    k.match <- which.max(cor(Aest[,k], ratMix3$A));
    mgk <- MGlist.FCboot[[k]];
    corr <- cor(Sest[mgk, k], ratMix3$S[mgk, k.match]);
    names(corr) <- colnames(ratMix3$A)[k.match];
    corr}))
#>     Brain     Liver      Lung 
#> 0.9982228 0.9950878 0.9984602

3.3 Alternative CAM Workflow

There major steps in CAM (data preprocessing, marker gene cluster detection, and matrix decomposition) can also be performed separately as a more flexible choice.

set.seed(111)

#Data preprocession
rPrep <- CAMPrep(data, thres.low = 0.30, thres.high = 0.95)
#> outlier cluster number: 0
#> convex hull cluster number: 48

#Marker gene cluster detection with a fixed K
rMGC <- CAMMGCluster(3, rPrep)

#A and S matrix estimation
rASest <- CAMASest(rMGC, rPrep, data)

#Obtain A and S matrix
Aest <- Amat(rASest)
Sest <- Smat(rASest)

#obtain marker gene list detected by CAM and used for A estimation
MGlist <- MGsforA(PrepResult = rPrep, MGResult = rMGC)

#obtain a full list of marker genes 
MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000)
MGlist.FC <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10])
MGlist.FCboot <- lapply(seq_len(3), function(x) 
    rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10])
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')

3.4 Optional Sample Clustering

We have implemented PCA in CAM()/CAMPrep() to reduce data dimensions. Sample clustering, as another data dimension reduction method, is optional prior to CAM()/CAMPrep().

#clustering
library(apcluster)
apres <- apclusterK(negDistMat(r=2), t(data),  K = 10) 
#> Trying p = -2427050 
#>    Number of clusters: 10 
#> Trying p = -15246196 
#>    Number of clusters: 7 
#> Trying p = -8124448 (bisection step no. 1 )
#>    Number of clusters: 7 
#> Trying p = -4563575 (bisection step no. 2 )
#>    Number of clusters: 8 
#> Trying p = -2783138 (bisection step no. 3 )
#>    Number of clusters: 9 
#> 
#> Number of clusters: 9 for p = -2783138
#You can also use apcluster(), but need to make sure the number of clusters is large than potential subpopulation number.

data.clusterMean <- lapply(slot(apres,"clusters"), 
                           function(x) rowMeans(data[, x, drop = FALSE]))
data.clusterMean <- do.call(cbind, data.clusterMean)

set.seed(111)
rCAM <- CAM(data.clusterMean, K = 2:5, thres.low = 0.30, thres.high = 0.95)
# or rPrep <- CAMPrep(data.clusterMean, thres.low = 0.30, thres.high = 0.95)

We can still follow the workflow in 3.2 or 3.3 to obtain marker gene list and estimated S matrix. However, the estimated A matrix is for the new data composed of cluster centers. The A matrix for the original data can be obtained by

Sest <- Smat(rCAM,3)
MGlist <- MGsforA(rCAM, K = 3)
Aest <- AfromMarkers(data, MGlist)
#(Optional) alternative re-estimation of A and S matrix
rre <- redoASest(data, MGlist, maxIter = 10) 

3.5 Case Study

3.5.1 GSE11058

GSE11058 run four immune cell lines on microarrays as well as their mixtures of various relative proportions. The code chunk below shows how to use CAM to blindly separate the four mixtures into four pure cell lines. Note that this data set contains 54613 probe/probesets. We can reduce running time by downsampling probe/probesets or remove more low-expressed genes (e.g. 70%).

#download data and phenotypes
library(GEOquery)
gsm<- getGEO('GSE11058')
pheno <- pData(phenoData(gsm[[1]]))$characteristics_ch1
mat <- exprs(gsm[[1]])
mat <- mat[-grep("^AFFX", rownames(mat)),]
mat.aggre <- sapply(unique(pheno), function(x) rowMeans(mat[,pheno == x]))
data <- mat.aggre[,5:8]

#running CAM
set.seed(111)
rCAM <- CAM(data, K = 4, thres.low = 0.70, thres.high = 0.95)
Aest <- Amat(rCAM, 4)
Aest
#>           [,1]      [,2]      [,3]       [,4]
#> [1,] 0.3883538 0.2389400 0.1039654 0.26874071
#> [2,] 0.2171408 0.4881505 0.2075359 0.08717281
#> [3,] 0.3480010 0.1810097 0.4342633 0.03672609
#> [4,] 0.3795312 0.3204192 0.2752069 0.02484273

#Use ground truth A to validate CAM-estimated A matrix
Atrue <- matrix(c(2.50, 0.50, 0.10, 0.02,
              1.25, 3.17, 4.95, 3.33,
              2.50, 4.75, 1.65, 3.33,
              3.75, 1.58, 3.30, 3.33), 4,4,
              dimnames = list(c("MixA", "MixB", "MixC","MixD"),
                               c("Jurkat", "IM-9", "Raji", "THP-1")))
Atrue <- Atrue / rowSums(Atrue)
Atrue
#>           Jurkat      IM-9      Raji     THP-1
#> MixA 0.250000000 0.1250000 0.2500000 0.3750000
#> MixB 0.050000000 0.3170000 0.4750000 0.1580000
#> MixC 0.010000000 0.4950000 0.1650000 0.3300000
#> MixD 0.001998002 0.3326673 0.3326673 0.3326673
cor(Aest, Atrue)
#>          Jurkat       IM-9       Raji       THP-1
#> [1,]  0.2958875 -0.2006075 -0.7497038  0.98630126
#> [2,] -0.1976556 -0.1508022  0.9935143 -0.88675357
#> [3,] -0.7918492  0.9725408 -0.4427615  0.03629895
#> [4,]  0.9981488 -0.8746620 -0.1050113  0.31145418

3.5.2 GSE41826 (methylation)

GSE41826 conducted a reconstitution experiment by mixing neuron and glial derived DNA from a single individual from 10% ~ 90%. The code chunk below shows how to use CAM to blindly separate 9 mixtures into neuron and glial specific CpG methylation quantification. Additional requirements of running CAM on methylation data:

  • Use beta-value, not M-value, as methylation quantification.
  • CpG sites in X or Y chromosomes must be removed if mixture tissues are from both males and females.
  • CpG sites needs to be downsampled since the huge number of CpG sites will slow the clustering process.
#download data
library(GEOquery)
gsm <- getGEO('GSE41826')
mixtureId <- unlist(lapply(paste0('Mix',seq_len(9)), 
    function(x) gsm[[1]]$geo_accession[gsm[[1]]$title==x]))
data <- gsm[[1]][,mixtureId]

#Remove CpG sites in sex chromosomes if tissues are from both males and females
#Not necessary in this example as mixtures are from the same subject
#gpl<- getGEO('GPL13534')
#annot<-dataTable(gpl)@table[,c('Name','CHR')]
#rownames(annot) <- annot$Name
#annot <- annot[rownames(data),]
#data <- data[annot$CHR != 'X' & annot$CHR != 'Y',]

#downsample CpG sites
featureId <- sample(seq_len(nrow(data)), 50000)

#Running CAM
#When input data has too many probes, lof has to be disabled due to space limit.
#When cluster.num is large, quick.select can be enable to increase the speed
rCAM <- CAM(data[featureId,], K = 2, thres.low = 0.10, thres.high = 0.60,
            cluster.num = 100, MG.num.thres = 10, 
            lof.thres = 0, quick.select = 20)
MGlist <- MGsforA(rCAM, K = 2)

#Identify markers from all CpG sites
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')

#re-esitmation with methylation constraint
rre <- redoASest(data, MGlist.re, maxIter = 20, methy = TRUE)

#Validation using ground truth A matrix
Atrue <- cbind(seq(0.1, 0.9, 0.1), seq(0.9, 0.1, -0.1))
cor(rre$Aest, Atrue)

We can also run CAM on unmethylated quantification, 1 - beta, to obtain quite similar results since the underlying linear mixing model is also applicable for unmethylated probe intensity.

rCAM <- CAM(1 - exprs(data[featureId,]),
            K = 2, thres.low = 0.10, thres.high = 0.60,
            cluster.num = 100, MG.num.thres = 10, 
            lof.thres = 0, quick.select = 20)

4 Supervised/Semi-supervised Deconvolution

4.1 Molecular markers are known

CAM algorithm can estimate A and S matrix based on blindly detected markers. Thus, we can also use part of CAM algorithm to estimate A and S matrix based on known markers.

The package provides AfromMarkers() to estimate A matrix from marker list.

Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation

Or we can use redoASest() to estimate A and S matrix from marker list and alternatively re-estimate two matrices by ALS to further reduce mean squared error. Note that allowing too many iterations of ALS may bring the risk of a significant deviation from initial values. The constraint for methylation data, \(\mathbf{S\in[0,1]}\), can be imposed during re-estimation.

rre <- redoASest(data, MGlist, maxIter = 10)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation
#maxIter = 0: No re-estimation by ALS
#rre$Aest: estimated A matrix
#rre$Sest: estimated S matrix

4.2 S matrix is known

Many datasets provide expression profiles for purified cell lines or even every single cell, which can be treated as references of S matrix. Some methods use least squares techniques or support vector regression (Newman et al. 2015) to estimate A matrix based on known S matrix. debCAM will estimate A matrix by identified markers from known S matrix, which has better performance in terms of correlation coefficient with ground truth A matrix.

data <- ratMix3$X
S <- ratMix3$S #known S matrix

#Identify markers
pMGstat <- MGstatistic(S, c("Liver","Brain","Lung"))
pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x) 
    rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])

#Estimate A matrix
Aest <- AfromMarkers(data, pMGlist.FC)

#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, maxIter = 10) 

debCAM also supports estimating A matrix directly from known S matrix using least squares method. Marker identification from S matrix is still needed because only markers’ squared errors will be counted which facilitates (1) faster computational running time and (2) a greater signal-to-noise ratio owing to markers’ discriminatory power (Newman et al. 2015).

Aest <- redoASest(data, pMGlist.FC, S=S, maxIter = 0)$Aest

4.3 A matrix is known

With known A matrix, debCAM estimates S matrix using non-negative least squares (NNLS) from NMF, and further identify markers.

data <- ratMix3$X
A <- ratMix3$A #known A matrix

#Estimate S matrix
Sest <- t(NMF::.fcnnls(A, t(data))$coef)

#Identify markers
pMGstat <- MGstatistic(data, A)
pMGlist.FC <- lapply(unique(pMGstat$idx), function(x) 
    rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])

#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, A=A,  maxIter = 10) 

4.4 Semi-supervised Deconvolution

When prior information of markers, S matrix and/or A matrix is available, semi-supervised deconvolution can also be performed by combining markers from prior information and markers identified by CAM. While supervised deconvolution cannot handle the underlying subpopulations without prior information, unsupervised deconvolution may miss the subpopulation without enough discrimination power. Therefore, semi-supervised deconvolution can take advantage of both methods.

Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation

Or

rre <- redoASest(data, MGlist, maxIter = 10)
#MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation
#maxIter = 0: No re-estimation by ALS
#rre$Aest: estimated A matrix
#rre$Sest: estimated S matrix

References

Newman, Aaron M., Chih Long Liu, Michael R. Green, Andrew J. Gentles, Weiguo Feng, Yue Xu, Chuong D. Hoang, Maximilian Diehn, and Ash A. Alizadeh. 2015. “Robust Enumeration of Cell Subsets from Tissue Expression Profiles.” Nat Meth 12 (5): 453–57. http://dx.doi.org/10.1038/nmeth.3337.

Wang, Niya, Eric P. Hoffman, Lulu Chen, Li Chen, Zhen Zhang, Chunyu Liu, Guoqiang Yu, David M. Herrington, Robert Clarke, and Yue Wang. 2016. “Mathematical Modelling of Transcriptional Heterogeneity Identifies Novel Markers and Subpopulations in Complex Tissues.” Scientific Reports 6: 18909. http://dx.doi.org/10.1038/srep18909.