One cause of diseases like cancer is the dysregulation of signalling pathways. The interaction of two or more genes is changed and cell behaviour is changed in the malignant tissue.
The estimation of causal effects from observational data has previously been used to elucidate gene interactions. We extend this notion to compute Differential Causal Effects (DCE). We compare the causal effects between two conditions, such as a malignant tissue (e.g., from a tumor) and a healthy tissue to detect differences in the gene interactions.
However, computing causal effects solely from given observations is difficult, because it requires reconstructing the gene network beforehand. To overcome this issue, we use prior knowledge from literature. This largely improves performance and makes the estimation of DCEs more accurate.
Overall, we can detect pathways which play a prominent role in tumorigenesis. We can even pinpoint specific interaction in the pathway that make a large contribution to the rise of the disease.
You can learn more about the theory in our publication.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("dce")
Load dce
package and other required libraries.
# fix "object 'guide_edge_colourbar' of mode 'function' was not found"
# when building vignettes
# (see also https://github.com/thomasp85/ggraph/issues/75)
library(ggraph)
library(curatedTCGAData)
library(TCGAutils)
library(SummarizedExperiment)
library(tidyverse)
library(cowplot)
library(graph)
library(dce)
set.seed(42)
To demonstrate the basic idea of Differential Causal Effects (DCEs), we first artificially create a wild-type network by setting up its adjacency matrix. The specified edge weights describe the direct causal effects and total causal effects are defined accordingly (Pearl 2010). In this way, the detected dysregulations are endowed with a causal interpretation and spurious correlations are ignored. This can be achieved by using valid adjustment sets, assuming that the underlying network indeed models causal relationships accurately. In a biological setting, these networks correspond, for example, to a KEGG pathway (Kanehisa et al. 2004) in a healthy cell. Here, the edge weights correspond to proteins facilitating or inhibiting each others expression levels.
graph_wt <- matrix(c(0, 0, 0, 1, 0, 0, 1, 1, 0), 3, 3)
rownames(graph_wt) <- colnames(graph_wt) <- c("A", "B", "C")
graph_wt
## A B C
## A 0 1 1
## B 0 0 1
## C 0 0 0
In case of a disease, these pathways can become dysregulated. This can be expressed by a change in edge weights.
graph_mt <- graph_wt
graph_mt["A", "B"] <- 2.5 # dysregulation happens here!
graph_mt
## A B C
## A 0 2.5 1
## B 0 0.0 1
## C 0 0.0 0
cowplot::plot_grid(
plot_network(graph_wt, edgescale_limits = c(-3, 3)),
plot_network(graph_mt, edgescale_limits = c(-3, 3)),
labels = c("WT", "MT")
)
By computing the counts based on the edge weights (root nodes are randomly initialized), we can generate synthetic expression data for each node in both networks. Both X_wt
and X_mt
then induce causal effects as defined in their respective adjacency matrices.
X_wt <- simulate_data(graph_wt)
X_mt <- simulate_data(graph_mt)
X_wt %>%
head
## A B C
## [1,] 1117 398 802
## [2,] 964 244 501
## [3,] 963 246 469
## [4,] 1204 502 1032
## [5,] 848 125 251
## [6,] 1163 455 885
Given the network topology (without edge weights!) and expression data from both WT and MT conditions, we can estimate the difference in causal effects for each edge between the two conditions. These are the aforementioned Differential Causal Effects (DCEs).
res <- dce(graph_wt, X_wt, X_mt)
res %>%
as.data.frame %>%
drop_na
## source target dce dce_stderr dce_pvalue
## 1 A B 1.5443343 0.03062434 3.373144e-114
## 2 A C 1.5234426 0.04449333 1.207257e-84
## 3 B C 0.1152442 0.17882577 5.200452e-01
Visualizing the result shows that we can recover the dysregulation of the edge from A
to B
.
Note that since we are computing total causal effects, the causal effect of A
on C
has changed as well.
plot(res) +
ggtitle("Differential Causal Effects between WT and MT condition")
To get a better feeling for the behavior of dce, we will look at DCE estimates for a larger pathway. In particular, we create a \(20\) node wild-type (WT) network with edge probability \(0.3\) as well as a dysregulated mutated (MT) network.
set.seed(1337)
# create wild-type and mutant networks
graph_wt <- create_random_DAG(20, 0.3)
graph_mt <- resample_edge_weights(graph_wt)
cowplot::plot_grid(
plot_network(as(graph_wt, "matrix"), labelsize = 0, arrow_size = 0.01),
plot_network(as(graph_mt, "matrix"), labelsize = 0, arrow_size = 0.01),
labels = c("WT", "MT")
)
Next, we simulate gene expression data and compute DCEs.
# simulate gene expression data for both networks
X_wt <- simulate_data(graph_wt)
X_mt <- simulate_data(graph_mt)
# compute DCEs
res <- dce::dce(graph_wt, X_wt, X_mt)
df_dce <- res %>%
as.data.frame %>%
drop_na %>%
arrange(dce_pvalue)
Finally, we compare the estimated to the ground truth DCEs.
# compute ground truth DCEs
dce_gt <- trueEffects(graph_mt) - trueEffects(graph_wt)
dce_gt_ind <- which(dce_gt != 0, arr.ind = TRUE)
# create plot
data.frame(
source = paste0("n", dce_gt_ind[, "row"]),
target = paste0("n", dce_gt_ind[, "col"]),
dce_ground_truth = dce_gt[dce_gt != 0]
) %>%
inner_join(df_dce, by = c("source", "target")) %>%
rename(dce_estimate = dce) %>%
ggplot(aes(x = dce_ground_truth, y = dce_estimate)) +
geom_abline(color = "gray") +
geom_point() +
xlab("DCE (ground truth") +
ylab("DCE (estimate)") +
theme_minimal()
We observe that dce is able to nicely recover DCEs of moderate as well as large and small magnitude.
Pathway dysregulations are a common cancer hallmark (Hanahan and Weinberg 2011). It is thus of interest to investigate how the causal effect magnitudes in relevant pathways vary between normal and tumor samples.
As a showcase, we download breast cancer (BRCA) RNA transcriptomics profiling data from TCGA (Tomczak, Czerwińska, and Wiznerowicz 2015).
brca <- curatedTCGAData(
diseaseCode = "BRCA",
assays = c("RNASeq2*"),
version = "2.0.1",
dry.run = FALSE
)
## Querying and downloading: BRCA_RNASeq2Gene-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: BRCA_RNASeq2GeneNorm-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: BRCA_colData-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: BRCA_metadata-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: BRCA_sampleMap-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## harmonizing input:
## removing 13161 sampleMap rows not in names(experiments)
## removing 5 colData rownames not in sampleMap 'primary'
This will retrieve all available samples for the requested data sets. These samples can be classified according to their site of origin.
sampleTables(brca)
## $`BRCA_RNASeq2Gene-20160128`
##
## 01 06 11
## 1093 7 112
##
## $`BRCA_RNASeq2GeneNorm-20160128`
##
## 01 06 11
## 1093 7 112
data(sampleTypes, package = "TCGAutils")
sampleTypes %>%
dplyr::filter(Code %in% c("01", "06", "11"))
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 2 06 Metastatic TM
## 3 11 Solid Tissue Normal NT
We can extract Primary Solid Tumor and matched Solid Tissue Normal samples.
# split assays
brca_split <- TCGAsplitAssays(brca, c("01", "11"))
# only retain matching samples
brca_matched <- as(brca_split, "MatchedAssayExperiment")
brca_wt <- assay(brca_matched, "01_BRCA_RNASeq2GeneNorm-20160128")
brca_mt <- assay(brca_matched, "11_BRCA_RNASeq2GeneNorm-20160128")
KEGG (Kanehisa et al. 2004) provides the breast cancer related pathway hsa05224
.
It can be easily retrieved using dce
.
pathways <- get_pathways(pathway_list = list(kegg = c("Breast cancer")))
brca_pathway <- pathways[[1]]$graph
Luckily, it shares all genes with the cancer data set.
shared_genes <- intersect(nodes(brca_pathway), rownames(brca_wt))
glue::glue(
"Covered nodes: {length(shared_genes)}/{length(nodes(brca_pathway))}"
)
## Covered nodes: 145/145
We can now estimate the differences in causal effects between matched tumor and normal samples on a breast cancer specific pathway.
res <- dce::dce(brca_pathway, t(brca_wt), t(brca_mt))
Interpretations may now begin.
res %>%
as.data.frame %>%
drop_na %>%
arrange(desc(abs(dce))) %>%
head
## source target dce dce_stderr dce_pvalue
## 1 WNT8A FZD4 -4342.2001 2876.9003 1.326486e-01
## 2 FGF3 FGFR1 -1553.3334 1322.3358 2.413890e-01
## 3 WNT8B FZD4 -1540.1078 235.1719 4.038543e-10
## 4 WNT7A FZD4 -1334.3963 598.5797 2.680676e-02
## 5 FGF21 FGFR1 -917.1445 1128.4631 4.172471e-01
## 6 WNT8A FZD7 817.6733 954.3545 3.924979e-01
plot(
res,
nodesize = 20, labelsize = 1,
node_border_size = 0.05, arrow_size = 0.01,
use_symlog = TRUE,
shadowtext = TRUE
)
We illustrate here how dce
can help to adjust for some special types of unobserved confounding, such as batch effects. One needs a relatively large data set in order to detect confounding well.
We first generate the unconfounded data:
set.seed(1)
epsilon <- 1e-100
network_size <- 50
graph_wt <- as(create_random_DAG(network_size, prob = .2), "matrix")
graph_wt["n1", "n2"] <- epsilon
graph_mt <- graph_wt
graph_mt["n1", "n2"] <- 2
cowplot::plot_grid(
plot_network(graph_wt, edgescale_limits = c(-2, 2)),
plot_network(graph_mt, edgescale_limits = c(-2, 2)),
labels = c("WT", "MT")
)
truth <- trueEffects(graph_mt) - trueEffects(graph_wt)
plot_network(graph_wt, value_matrix = truth, edgescale_limits = c(-2, 2))
X_wt <- simulate_data(n = 100, graph_wt)
X_mt <- simulate_data(n = 100, graph_mt)
For the unconfounded data dce
estimates the differential causal effects well:
res <- dce(graph_mt, X_wt, X_mt, deconfounding = FALSE)
qplot(truth[truth != 0], res$dce[truth != 0]) +
geom_abline(color = "red", linetype = "dashed") +
xlab("true DCE") +
ylab("estimated DCE")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).
plot_network(graph_wt, value_matrix = -log(res$dce_pvalue)) +
ggtitle("-log(p-values) for DCEs between WT and MT condition")
On the other hand, if our data come from two different batches, where the gene expression of each gene is shifted by some amount depending on the batch, then dce
will have many false positive findings:
batch <- sample(c(0, 1), replace = TRUE, nrow(X_wt))
bX_wt <- apply(X_wt, 2, function(x) x + max(x) * runif(1) * batch)
bX_mt <- apply(X_mt, 2, function(x) x + max(x) * runif(1) * batch)
res_without_deconf <- dce(graph_mt, bX_wt, bX_mt, deconfounding = FALSE)
cowplot::plot_grid(
plot_network(
graph_wt,
value_matrix = -log(res_without_deconf$dce_pvalue + epsilon)
) +
ggtitle("-log(p-values) without deconfounding"),
qplot(truth[truth != 0], res_without_deconf$dce[truth != 0]) +
geom_abline(color = "red", linetype = "dashed") +
xlab("true DCE") +
ylab("estimated DCE"),
nrow = 1
)
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).
However, the performance is improved when the confounding adjustment is used:
res_with_deconf <- dce(graph_mt, bX_wt, bX_mt, deconfounding = 1)
cowplot::plot_grid(
plot_network(
graph_wt,
value_matrix = -log(res_with_deconf$dce_pvalue + epsilon)
) +
ggtitle("-log(p-values) with deconfounding"),
qplot(truth[truth != 0], res_with_deconf$dce[truth != 0]) +
geom_abline(color = "red", linetype = "dashed") +
xlab("true DCE") +
ylab("estimated DCE"),
nrow = 1
)
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dce_1.13.0 graph_1.83.0
## [3] cowplot_1.1.3 lubridate_1.9.3
## [5] forcats_1.0.0 stringr_1.5.1
## [7] dplyr_1.1.4 purrr_1.0.2
## [9] readr_2.1.5 tidyr_1.3.1
## [11] tibble_3.2.1 tidyverse_2.0.0
## [13] TCGAutils_1.25.1 curatedTCGAData_1.27.0
## [15] MultiAssayExperiment_1.31.5 SummarizedExperiment_1.35.4
## [17] Biobase_2.65.1 GenomicRanges_1.57.2
## [19] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [21] S4Vectors_0.43.2 BiocGenerics_0.51.3
## [23] MatrixGenerics_1.17.0 matrixStats_1.4.1
## [25] ggraph_2.2.1 ggplot2_3.5.1
## [27] BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 httr_1.4.7
## [3] GenomicDataCommons_1.29.7 prabclus_2.3-4
## [5] Rgraphviz_2.49.1 numDeriv_2016.8-1.1
## [7] tools_4.4.1 utf8_1.2.4
## [9] R6_2.5.1 vegan_2.6-8
## [11] mgcv_1.9-1 sn_2.1.1
## [13] permute_0.9-7 withr_3.0.1
## [15] graphite_1.51.0 gridExtra_2.3
## [17] flexclust_1.4-2 cli_3.6.3
## [19] sandwich_3.1-1 labeling_0.4.3
## [21] sass_0.4.9 diptest_0.77-1
## [23] mvtnorm_1.3-1 robustbase_0.99-4-1
## [25] proxy_0.4-27 Rsamtools_2.21.2
## [27] FMStable_0.1-4 Linnorm_2.29.0
## [29] plotrix_3.8-4 limma_3.61.12
## [31] RSQLite_2.3.7 generics_0.1.3
## [33] BiocIO_1.15.2 gtools_3.9.5
## [35] wesanderson_0.3.7 Matrix_1.7-1
## [37] fansi_1.0.6 logger_0.3.0
## [39] abind_1.4-8 lifecycle_1.0.4
## [41] multcomp_1.4-26 yaml_2.3.10
## [43] edgeR_4.3.20 mathjaxr_1.6-0
## [45] SparseArray_1.5.45 BiocFileCache_2.13.2
## [47] Rtsne_0.17 grid_4.4.1
## [49] blob_1.2.4 promises_1.3.0
## [51] gdata_3.0.0 ppcor_1.1
## [53] bdsmatrix_1.3-7 ExperimentHub_2.13.1
## [55] crayon_1.5.3 lattice_0.22-6
## [57] GenomicFeatures_1.57.1 chromote_0.3.1
## [59] KEGGREST_1.45.1 magick_2.8.5
## [61] pillar_1.9.0 knitr_1.48
## [63] rjson_0.2.23 fpc_2.2-13
## [65] corpcor_1.6.10 codetools_0.2-20
## [67] mutoss_0.1-13 glue_1.8.0
## [69] RcppArmadillo_14.0.2-1 data.table_1.16.2
## [71] vctrs_0.6.5 png_0.1-8
## [73] Rdpack_2.6.1 mnem_1.21.0
## [75] gtable_0.3.5 kernlab_0.9-33
## [77] assertthat_0.2.1 amap_0.8-20
## [79] cachem_1.1.0 xfun_0.48
## [81] mime_0.12 rbibutils_2.3
## [83] S4Arrays_1.5.11 RcppEigen_0.3.4.0.2
## [85] tidygraph_1.3.1 survival_3.7-0
## [87] tinytex_0.53 fastICA_1.2-5.1
## [89] statmod_1.5.0 TH.data_1.1-2
## [91] tsne_0.1-3.1 nlme_3.1-166
## [93] naturalsort_0.1.3 bit64_4.5.2
## [95] gmodels_2.19.1 filelock_1.0.3
## [97] bslib_0.8.0 colorspace_2.1-1
## [99] DBI_1.2.3 nnet_7.3-19
## [101] mnormt_2.1.1 tidyselect_1.2.1
## [103] processx_3.8.4 bit_4.5.0
## [105] compiler_4.4.1 curl_5.2.3
## [107] rvest_1.0.4 expm_1.0-0
## [109] xml2_1.3.6 TFisher_0.2.0
## [111] ggdendro_0.2.0 DelayedArray_0.31.14
## [113] shadowtext_0.1.4 bookdown_0.41
## [115] rtracklayer_1.65.0 harmonicmeanp_3.0.1
## [117] sfsmisc_1.1-19 scales_1.3.0
## [119] DEoptimR_1.1-3 RBGL_1.81.0
## [121] rappdirs_0.3.3 apcluster_1.4.13
## [123] digest_0.6.37 snowfall_1.84-6.3
## [125] rmarkdown_2.28 XVector_0.45.0
## [127] htmltools_0.5.8.1 pkgconfig_2.0.3
## [129] highr_0.11 dbplyr_2.5.0
## [131] fastmap_1.2.0 rlang_1.1.4
## [133] UCSC.utils_1.1.0 farver_2.1.2
## [135] jquerylib_0.1.4 zoo_1.8-12
## [137] jsonlite_1.8.9 BiocParallel_1.39.0
## [139] mclust_6.1.1 RCurl_1.98-1.16
## [141] magrittr_2.0.3 modeltools_0.2-23
## [143] GenomeInfoDbData_1.2.13 munsell_0.5.1
## [145] Rcpp_1.0.13 viridis_0.6.5
## [147] stringi_1.8.4 zlibbioc_1.51.2
## [149] MASS_7.3-61 plyr_1.8.9
## [151] AnnotationHub_3.13.3 org.Hs.eg.db_3.20.0
## [153] flexmix_2.3-19 parallel_4.4.1
## [155] ggrepel_0.9.6 Biostrings_2.73.2
## [157] graphlayouts_1.2.0 splines_4.4.1
## [159] multtest_2.61.0 hms_1.1.3
## [161] locfit_1.5-9.10 qqconf_1.3.2
## [163] ps_1.8.0 igraph_2.1.1
## [165] fastcluster_1.2.6 reshape2_1.4.4
## [167] BiocVersion_3.20.0 XML_3.99-0.17
## [169] evaluate_1.0.1 metap_1.11
## [171] pcalg_2.7-12 BiocManager_1.30.25
## [173] tzdb_0.4.0 tweenr_2.0.3
## [175] polyclip_1.10-7 clue_0.3-65
## [177] BiocBaseUtils_1.7.3 ggforce_0.4.2
## [179] restfulr_0.0.15 e1071_1.7-16
## [181] later_1.3.2 viridisLite_0.4.2
## [183] class_7.3-22 snow_0.4-4
## [185] websocket_1.4.2 ggm_2.5.1
## [187] memoise_2.0.1 AnnotationDbi_1.67.0
## [189] GenomicAlignments_1.41.0 ellipse_0.5.0
## [191] cluster_2.1.6 timechange_0.3.0
Hanahan, Douglas, and Robert A Weinberg. 2011. “Hallmarks of Cancer: The Next Generation.” Cell 144 (5): 646–74.
Kanehisa, Minoru, Susumu Goto, Shuichi Kawashima, Yasushi Okuno, and Masahiro Hattori. 2004. “The Kegg Resource for Deciphering the Genome.” Nucleic Acids Research 32 (suppl_1): D277–D280.
Pearl, Judea. 2010. “Causal Inference.” Causality: Objectives and Assessment, 39–58.
Tomczak, Katarzyna, Patrycja Czerwińska, and Maciej Wiznerowicz. 2015. “The Cancer Genome Atlas (Tcga): An Immeasurable Source of Knowledge.” Contemporary Oncology 19 (1A): A68.