Multiple co-inertia analysis (MCIA) is a member of the family of joint dimensionality reduction (jDR) methods that extend unsupervised dimension reduction techniques such as Principal Components Analysis (PCA) and Non-negative Matrix Factorization (NMF) to datasets with multiple data blocks (alternatively called views) (Cantini, 2021).
Here, we present a new implementation in R of MCIA, nipalsMCIA
,
that uses an extension of Non-linear Iterative Partial Least Squares (NIPALS)
to solve the MCIA optimization problem (Hanafi, 2011).
This implementation has several features, including speed-up over approaches that
employ the Singular Value Decomposition (SVD), several options for pre-processing
and deflation to customize algorithm performance, methodology to perform
out-of-sample global embedding, and analysis and visualization capabilities to
maximize result interpretation.
While there exist additional implementations of MCIA (e.g. mogsa, omicade4), ours is unique in providing a pipeline that incorporates pre-processing data options including those present in the original development of MCIA (including a theoretically grounded calculation of inertia, or total variance) with an iterative solver that shows speed-up for larger datasets, and is explicitly designed for simultaneous ease of use as a tool for multi-view data decomposition as well as a foundation for theoretical and computational development of MCIA and related methodology. A manuscript detailing our implementation is forthcoming.
In this vignette, we will cover the most important functions
within the nipalsMCIA
package as well as downstream analyses that can help
interpret the MCIA decomposition using a cancer data set from
Meng et al., 2016 that includes 21
subjects with three data blocks. The data blocks include mRNA levels
(12895 features), microRNA levels (537 features) and protein levels
(7016 features).
The nipals_multiblock
function performs MCIA using the NIPALS algorithm.
nipals_multiblock
outputs a decomposition that includes a low-dimensional
embedding of the data in the form of global scores, and the contributions of
the data blocks (block score weights) and features (global loadings) to these
same global scores. nipalsMCIA
provides several additional functions to
visualize, analyze, and interpret these results.
The nipals_multiblock
function accepts as input a MultiAssayExperiment
(MAE) object. Such objects represent a modern classed-based approach to
organizing multi-omics data in which each assay can be stored as an individual
experiment alongside relevant metadata for samples and experiments. If users
have a list of data blocks with matching sample names (and optional sample-level
metadata), we provide a simple conversion function (simple_mae.R
) to generate
an MAE object. For more sophisticated MAE object construction, please consult
the MAE documentation.
In the context of the NCI-60 data set and this vignette, we will show you the power of MCIA to find important relationships between mRNA, microRNA and proteins. More specifically, we will show you how to interpret the global factor scores in Part 1: Interpreting Global Factor Scores and global loadings in Part 2: Interpreting Global Loadings.
# devel version
# install.packages("devtools")
devtools::install_github("Muunraker/nipalsMCIA", ref = "devel",
force = TRUE, build_vignettes = TRUE) # devel version
# release version
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("nipalsMCIA")
library(ComplexHeatmap)
library(dplyr)
library(fgsea)
library(ggplot2)
library(ggpubr)
library(nipalsMCIA)
library(stringr)
# NIPALS starts with a random vector
set.seed(42)
The NCI-60 data set has been included with the nipalsMCIA
package and is
easily available as shown below:
# load the dataset which uses the name data_blocks
data(NCI60)
# examine the contents
data_blocks$miRNA[1:3, 1:3]
MI0000060_miRNA | MI0000061_miRNA | MI0000061.1_miRNA | |
---|---|---|---|
CNS.SF_268 | 11.91 | 6.71 | 13.11 |
CNS.SF_295 | 11.94 | 7.13 | 12.86 |
CNS.SF_539 | 11.50 | 5.79 | 12.10 |
data_blocks$mrna[1:3, 1:3]
5-HT3C2_1_mrna | A1BG-AS1_2_mrna | A2LD1_3_mrna | |
---|---|---|---|
CNS.SF_268 | 0.53 | 0.35 | -0.05 |
CNS.SF_295 | -0.42 | 0.54 | -1.04 |
CNS.SF_539 | 0.00 | 0.80 | 0.85 |
data_blocks$prot[1:3, 1:3]
STAU1_1_prot | NRAS_2_prot | HRAS_3_prot | |
---|---|---|---|
CNS.SF_268 | 5.712331 | 7.385177 | 5.758845 |
CNS.SF_295 | 0.000000 | 6.327175 | 0.000000 |
CNS.SF_539 | 0.000000 | 6.597432 | 0.000000 |
To convert data_blocks into an MAE object we provide the simple_mae()
function:
data_blocks_mae <- simple_mae(data_blocks, row_format = "sample")
We can compute the MCIA decomposition for \(r\) global factors. For our example, we take \(r=10\).
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
col_preproc_method = "colprofile",
num_PCs = 10, tol = 1e-12, plots = "none")
The result is an NipalsResult object containing several outputs from the decomposition:
slotNames(mcia_results)
## [1] "global_scores" "global_loadings" "block_score_weights"
## [4] "block_scores" "block_loadings" "eigvals"
## [7] "col_preproc_method" "block_preproc_method" "block_variances"
## [10] "metadata"
We describe the first two in more detail below, and will discuss several others in the remainder of the vignette. For additional details on the decomposition, see (Hanafi, 2011, Mattesich, 2022).
The global_scores
matrix is represented by \(\mathbf{F}\) with dimensions
\(n \times r\), where \(n\) is the number of samples and \(r\) is the number of
factors chosen by using the num_PCs = r
argument. Each column \(j\) of this matrix
represents the global scores for factor \(j\),
\[ \mathbf{F} = \begin{pmatrix} | & |& & |\\ \mathbf{f}^{(1)} &\mathbf{f}^{(2)} & \dots & \mathbf{f}^{(r)}\\ | & |& & | \end{pmatrix} \in \mathbb{R}^{n \times r} \]
This matrix encodes a low-dimensional representation of the data set, with the \(i\)-th row representing a set of \(r\)-dimensional coordinates for the \(i\)-th sample.
The global_loadings
matrix is represented by \(\mathbf{A}\) that is \(p \times r\),
where \(p\) is the number of features across all omics and \(r\) is as before.
Each column \(j\) of this matrix represents the global loadings for factor \(j\),
i.e.
\[ \mathbf{A} = \begin{pmatrix} | & |& & |\\ \mathbf{a}^{(1)} &\mathbf{a}^{(2)} & \dots & \mathbf{a}^{(r)}\\ | & |& & | \end{pmatrix} \in \mathbb{R}^{p \times r} \]
This matrix encodes the contribution (loading
) of each feature to the global
score.
The remainder of this vignette will be broken down into two sections, Part 1: Interpreting Global Factor Scores and Part 2: Interpreting Global Loadings where we show how to interpret \(\mathbf{F}\) and \(\mathbf{A}\), respectively.
In the introduction we showed how to calculate the MCIA decomposition using
nipals_multiblock()
but used the parameter plots = "none"
to avoid the
default plotting behavior of this function. By default, this function will
generate two plots which help establish an initial intuition for the MCIA
decomposition. Here we will re-run nipals_multiblock()
with the default plots
parameter (all
):
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
col_preproc_method = "colprofile",
num_PCs = 10, tol = 1e-12)
The first plot visualizes each sample using factor 1 and 2 as a lower dimensional representation (factor plot).
The second plot is a scree plot of normalized singular values corresponding to the variance explained by each factor.
For clustering, it is useful to look at global factor scores without block
factor scores. The projection_plot()
function can be used to generate such a
plot using projection = "global"
.
projection_plot(mcia_results, projection = "global", orders = c(1, 2))
In addition, scores can be colored by a meaningful label such as cancer type
which is highly relevant to NCI-60. To do so, the colData slot of the associated
MAE object must be loaded with sample-level metadata prior to invoking
projection_plot()
. The sample metadata is composed of row names corresponding to
the primary sample names of the MAE object, and columns contain different
metadata (e.g. age, disease status, etc). For instance, each of the 21 samples
in the NCI-60 dataset represents a cell line with one of three cancer types:
CNS, Leukemia, or Melanoma. We have provided this metadata as part of the
data(NCI60)
dataset and we next show how it can be included in the resulting MAE
object using the colData
parameter in simple_mae()
:
# preview of metadata
head(metadata_NCI60)
cancerType | |
---|---|
CNS.SF_268 | CNS |
CNS.SF_295 | CNS |
CNS.SF_539 | CNS |
CNS.SNB_19 | CNS |
CNS.SNB_75 | CNS |
CNS.U251 | CNS |
# loading of mae with metadata
data_blocks_mae <- simple_mae(data_blocks, row_format = "sample",
colData = metadata_NCI60)
We now rerun nipals_multiblock()
using the updated MAE object, where the
colData
is passed to the metadata
slot of the NipalsResult
instance, and
then visualize the global factor scores using the projection_plot()
function.
# adding metadata as part of the nipals_multiblock() function
set.seed(42)
mcia_results <- nipals_multiblock(data_blocks_mae,
col_preproc_method = "colprofile",
plots = "none",
num_PCs = 10, tol = 1e-12)
The color_col
argument of projection_plot()
can then be used to determine
which column of metadata
is used for coloring the individual data points, in
this case cancerType
. color_pal
is used to assign a color palette and
requires a vector of colors (i.e. c("blue", "red", "green")
). To help create
this vector we also provide get_metadata_colors()
, a helper function (used
below) that can be used with a scales::<function>
to return an appropriate
vector of colors. Note: colors are applied by lexicographically sorting the list
of unique metadata values then assigning the first color to the first value,
second with second and so on.
# meta_colors = c("blue", "grey", "yellow") can use color names
# meta_colors = c("#00204DFF", "#7C7B78FF", "#FFEA46FF") can use hex codes
meta_colors <- get_metadata_colors(mcia_results, color_col = 1,
color_pal_params = list(option = "E"))
projection_plot(mcia_results, projection = "global", orders = c(1, 2),
color_col = "cancerType", color_pal = meta_colors,
legend_loc = "bottomleft")
Using this plot one can observe that global factor scores for factor 1 and 2 can separate samples into their cancer types.
A heatmap can be used to cluster samples based on global scores across all factors
using global_scores_heatmap()
. The samples can be colored by the associated metadata
using color_cor
+ color_pal
as shown below.
global_scores_heatmap(mcia_results = mcia_results,
color_col = "cancerType", color_pal = meta_colors)