This document is a presentation of the R implementation of the tool CIMICE. It shows the main features of this software and how it is built as a modular pipeline, with the goal of making it easy to change and update.
CIMICE is a tool in the field of tumor phylogenetics and its goal is to build a Markov Chain (called Cancer Progression Markov Chain, CPMC) in order to model tumor subtypes evolution. The input of CIMICE is a Mutational Matrix, so a boolean matrix representing altered genes in a collection of samples. These samples are assumed to be obtained with single-cell DNA analysis techniques and the tool is specifically written to use the peculiarities of this data for the CMPC construction.
CIMICE data processing and analysis can be divided in four section:
These steps will be presented in the following sections.
This implementation of CIMICE is built as a single library on its own:
library(CIMICE)
and it requires the following libraries:
# Dataframe manipulation
library(dplyr)
# Plot display
library(ggplot2)
# Improved string operations
library(glue)
# Dataframe manipulation
library(tidyr)
# Graph data management
library(igraph)
# Remove transitive edges on a graph
# library(relations)
# Interactive graph visualization
library(networkD3)
# Interactive graph visualization
library(visNetwork)
# Correlation plot visualization
library(ggcorrplot)
# Functional R programming
library(purrr)
# Graph Visualization
library(ggraph)
# sparse matrices
library(Matrix)
CIMICE requires a boolean dataframe as input, structured as follows:
It is possible to load this information from a file. The default input format for CIMICE is the “CAPRI/CAPRESE” TRONCO format:
This is a scheme of CIMICE’s input format:
s\g gene_1 gene_2 ... gene_n
sample_1 1 0 ... 0
...
sample_m 1 1 ... 1
and this an example on how to load a dataset from the file system:
# Read input dataset in CAPRI/CAPRESE format
dataset.big <- read_CAPRI(system.file("extdata", "example.CAPRI", package = "CIMICE", mustWork = TRUE))
sfmF | ulaF | dapB | yibT | phnL | ispA | |
---|---|---|---|---|---|---|
GCF_001281685.1_ASM128168v1_genomic.fna | 1 | 1 | 0 | 0 | 0 | 0 |
GCF_001281725.1_ASM128172v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281755.1_ASM128175v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281775.1_ASM128177v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281795.1_ASM128179v1_genomic.fna | 0 | 1 | 0 | 0 | 0 | 0 |
GCF_001281815.1_ASM128181v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
## [1] "ncol: 999 - nrow: 160"
Another option is to define directly the dataframe in R. This is made easy by the functions make_dataset
and update_df
, used as follows:
# genes
dataset <- make_dataset(A,B,C,D) %>%
# samples
update_df("S1", 0, 0, 0, 1) %>%
update_df("S2", 1, 0, 0, 0) %>%
update_df("S3", 1, 0, 0, 0) %>%
update_df("S4", 1, 0, 0, 1) %>%
update_df("S5", 1, 1, 0, 1) %>%
update_df("S6", 1, 1, 0, 1) %>%
update_df("S7", 1, 0, 1, 1) %>%
update_df("S8", 1, 1, 0, 1)
with the following outcome:
## 8 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S3 1 0 0 0
## S4 1 0 0 1
## S5 1 1 0 1
## S6 1 1 0 1
## S7 1 0 1 1
## S8 1 1 0 1
In the case your data is composed by samples with associated frequencies it is possible to use an alternative format that we call “CAPRIpop”:
s/g gene_1 gene_2 ... gene_n freq
sample_1 1 0 ... 0 freq_s1
...
sample_m 1 1 ... 1 freq_sm
where the freq
column is mandatory and sample must not be repeated. Frequencies
in the freq
column will be automatically normalized. This format is meant
to be used with the functions quick_run(dataset, mode="CAPRIpop")
for the
full analysis and dataset_preprocessing_population(...)
for the preprocessing
stage only. The subsequent operations remain otherwise equal to those
of the default format.
Another option is to compute a mutational matrix directly from a MAF file, which can be done as follows:
# path to MAF file
read_MAF(system.file("extdata", "paac_jhu_2014_500.maf", package = "CIMICE", mustWork = TRUE))[1:5,1:5]
## -Reading
## -Validating
## -Silent variants: 49
## -Summarizing
## --Possible FLAGS among top ten genes:
## HMCN1
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.096s elapsed (0.118s cpu)
## 5 x 5 sparse Matrix of class "dgCMatrix"
## CSMD2 C2orf61 DCHS2 DPYS GPR158
## ACINAR28 1 1 . . .
## ACINAR27 1 . . . .
## ACINAR25 1 . . . .
## ACINAR02 . . 1 . 1
## ACINAR13 . . 1 . .
This implementation of CIMICE includes simple functions to quickly analyze the distributions of mutations among genes and samples.
The following code displays an histogram showing the distribution of the number of mutations hitting a gene:
gene_mutations_hist(dataset.big)
And this does the same but from the samples point of view:
sample_mutations_hist(dataset.big, binwidth = 10)
In case of huge dataset, it could be necessary to focus only on a subset of the input samples or genes. The following procedures aim to provide an easy way to do so when the goal is to use the most (or least) mutated samples or genes.
Keeps the first \(n\) (=100) most mutated genes:
select_genes_on_mutations(dataset.big, 100)
eptA | yohC | yedK | yeeO | narU | rsxC | |
---|---|---|---|---|---|---|
GCF_001281685.1_ASM128168v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281725.1_ASM128172v1_genomic.fna | 1 | 1 | 1 | 0 | 0 | 0 |
GCF_001281755.1_ASM128175v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281775.1_ASM128177v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 0 |
GCF_001281795.1_ASM128179v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281815.1_ASM128181v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
## [1] "ncol: 100 - nrow: 160"
Keeps the first \(n\) (=100) least mutated samples:
select_samples_on_mutations(dataset.big, 100, desc = FALSE)
sfmF | ulaF | dapB | yibT | phnL | ispA | |
---|---|---|---|---|---|---|
GCF_001281725.1_ASM128172v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281855.1_ASM128185v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281815.1_ASM128181v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281775.1_ASM128177v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001607735.1_ASM160773v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001297965.1_ASM129796v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
## [1] "ncol: 999 - nrow: 100"
It is easy to combine these selections by using the pipe operator %>%
:
select_samples_on_mutations(dataset.big , 100, desc = FALSE) %>% select_genes_on_mutations(100)
eptA | yohC | yedK | argK | yeeO | narU | |
---|---|---|---|---|---|---|
GCF_001281725.1_ASM128172v1_genomic.fna | 1 | 1 | 1 | 1 | 0 | 0 |
GCF_001281855.1_ASM128185v1_genomic.fna | 1 | 1 | 1 | 1 | 0 | 0 |
GCF_001281815.1_ASM128181v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281775.1_ASM128177v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001607735.1_ASM160773v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001297965.1_ASM129796v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
## [1] "ncol: 100 - nrow: 100"
It may be of interest to show correlations among gene or sample mutations. The library corrplots
provides an easy way to do so by preparing an
heatmap based on the correlation matrix. We can show these plots by using the following comands:
gene mutations correlation:
corrplot_genes(dataset)
sample mutations correlation:
corrplot_samples(dataset)